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Calculating optimal policies is known to be computationally difficult for Markov decision processes (MDPs) 
with Borel state and action spaces. This paper studies finite-state approximations of discrete time Markov 
decision processes with Borel state and action spaces, for both discounted and average costs criteria. The 
stationary policies thus obtained are shown to approximate the optimal stationary policy with arbitrary 
precision under quite general conditions for discounted cost and more restrictive conditions for average cost. 
For compact-state MDPs, we obtain explicit rate of convergence bounds quantifying how the approximation 
improves as the size of the approximating finite state space increases. Using information theoretic arguments, 
the order optimality of the obtained convergence rates is established for a large class of problems. We also 
show that, as a pre-processing step the action space can also be finitely approximated with sufficiently large 
number points; thereby, well known algorithms, such as value or policy iteration, Q-learning, etc., can be 
used to calculate near optimal policies. 
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1. Introduction. In this paper, our goal is to study the finite-state approximation problem 
for computing near optimal policies for discrete time Markov decision processes (MDPs) with Borel 
state and action spaces, under discounted and average costs criteria. Although the existence and 
structural properties of optimal policies have been studied extensively in the literature, computing 
such policies is generally a challenging problem for systems with uncountable state spaces. This 
situation also arises in the fully observed reduction of a partially observed Markov decision process 
even when the original system has finite state and action spaces (see, e.g., Yu and Bertsekas [45]). 

As has been extensively studied in the literature (see, e.g., Chow and Tsitsiklis [11] and the 
literature review below), one way to compute approximately optimal solutions for such MDPs is 
to construct a reduced model with a new transition probability and a one-stage cost function by 
quantizing the state/action spaces, i.e., by discretizing them on a finite grid. We exhibit that under 
quite general continuity conditions on the one-stage cost function and the transition probability 
for the discounted cost and under some additional restrictions on the ergodicity properties of 
Markov chains induced by deterministic stationary policies for the average cost, the optimal policy 
for the approximating finite model applied to the original model has cost that converges to the 
optimal cost, as the discretization becomes finer. Moreover, under additional continuity conditions 
on the transition probability and the one stage cost function we also obtain bounds for a rate of 
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approximation in terms of the number of points used to discretize the state space, thereby providing 
a tradeoff between the computation cost and the performance loss in the system. In particular, we 
study the following two problems. 

(Qi) Under what conditions on the components of the MDP do the true costs corresponding 
to the optimal policies obtained from finite models converge to the optimal value function as the 
number of grid points goes to infinity? For this problem, we are only concerned with the convergence 
of the approximation; that is, we do not establish bounds for a rate of approximation. 

(Q2) Can we obtain explicit bounds on the performance loss due to the discretization in terms 
of the number of grid points if we strengthen the conditions sufficient in (Ql)? 

Combined with our recent works Saldi et al. [33, 34], where we investigated the asymptotic opti¬ 
mality of the quantization of action sets, the results in this paper lead to a constructive algorithm 
for obtaining approximately optimal solutions. First the action space is quantized with small er¬ 
ror, and then the state space is quantized with small error, which results in a finite model that 
well approximates the original MDP. When the state space is compact, we also obtain rates of 
convergence for both approximations, and using information theoretic tools we establish that the 
obtained rates of convergence are order-optimal for a given class of MDPs. Since there exist various 
computational algorithms for finite-state Markov decision problems, the analysis in this paper can 
be considered to be constructive. 

Various methods have been developed to compute approximate value functions and near optimal 
policies. A partial list of these techniques is as follows: approximate dynamic programming, ap¬ 
proximate value or policy iteration, simulation-based techniques, neuro-dynamic programming (or 
reinforcement learning), state aggregation, etc. For rather complete surveys of these techniques, 
we refer the reader to Fox [17], Whitt [42, 43], Langen [28], Bertsekas and Tsitsiklis [6], Ren and 
Krogh [32], Ortner [30], White [40, 41], Bertsekas [4], Dufour and Prieto-Rumeau [14, 15] and ref¬ 
erences therein. With the exception of Dufour and Prieto-Rumeau [15], Ortner [30], these papers 
in general study either the finite horizon cost or the discounted infinite horizon cost. Also, the 
majority of these results are for MDPs with discrete (i.e., finite or countable) state and action 
spaces, or a bounded one-stage cost function (e.g., Fox [17], Whitt [42, 43], Van Roy [37], White 
[40, 41], Cavazos-Cadena [9], Bertsekas and Tsitsiklis [6], Ren and Krogh [32], Ortner [30], Bert¬ 
sekas [4]). Those that consider general state and action spaces (see, e.g., Dufour and Prieto-Rumeau 
[13, 14, 15], Bertsekas [4], Chow and Tsitsiklis [11]) assume in general Lipschitz type continuity 
conditions on the components of the control model, in order to provide a rate of convergence 
analysis for the approximation error. Some of the results only consider approximating the value 
function and do not provide a procedure to compute near optimal policies (e.g., Langen [28], Whitt 
[43], Dufour and Prieto-Rumeau [14]). 

Our paper differs from these results in the following ways: (i) we consider a general setup, where 
the state and action spaces are Borel (with the action space being compact), and the one-stage cost 
function is possibly unbounded, (ii) since we do not aim to provide rate of convergence result in the 
first problem (Ql), the continuity assumptions we impose on the components of the control model 
are weaker than the conditions imposed in prior works that considered general state and action 
spaces, (iii) we also consider the challenging average cost criterion under reasonable assumptions. 
The price we pay for imposing weaker assumptions in (Qi) is that we do not obtain explicit 
performance bounds in terms of the number of grid points used in the approximations. However, 
such bounds can be obtained under further assumptions on the transition probability and the 
one-stage cost functions; this is considered in problem (Q2) for compact-state MDPs. 

Our approach to solve problem (Ql) can be summarized as follows: (i) first, we obtain ap¬ 
proximation results for the compact-state case, (ii) we find conditions under which a compact 
representation leads to near optimality for non-compact state MDPs, (iii) we prove the conver¬ 
gence of the finite-state models to non-compact models. As a by-product of this analysis, we obtain 
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compact-state-space approximations for an MDP with non-compact Borel state space. In particular, 
our findings directly lead to finite models if the state space is countable; similar problems in the 
countable context have been studied in the literature for the discounted cost; see Puterman [31, 
Section 6.10.2], 

We note that the proposed method for solving the approximation problem for compact-state 
MDPs with the discounted cost is partly inspired by Van Roy [37]. Specifically, we generalize the 
operator proposed for an approximate value iteration algorithm in Van Roy [37] to uncountable 
state spaces. Then, unlike in Van Roy [37], we use this operator as a transition step between the 
original optimality operator and the optimality operator of the approximate model. In Ortner [30], 
a similar construction was given for finite state-action MDPs. Our method to obtain finite-state 
MDPs from the compact-state model can be regarded as a generalization of this construction. We 
note that a related work of Dufour and Prieto-Rumeau [15] develops a sequence of approxima¬ 
tions using empirical distributions of an underlying probability measure with respect to which the 
transition probability of the MDP is absolutely continuous. By imposing Lipschitz type continu¬ 
ity conditions on the components of the control model, Dufour and Prieto-Rumeau [15] obtains 
a concentration inequality type upper bound on the accuracy of the approximation based on the 
Wasserstein distance of order 1 between the probability measure and its empirical estimate. These 
conditions are stronger than what we impose for the problem (Ql). We note that Dufour and 
Prieto-Rumeau [15] adopts a simulation based approximation leading to probabilistic guarantees 
on the approximation, whereas we adopt a quantization based approach leading to deterministic 
approximation guarantees. For a review of further simulation based methods, see e.g., Chang et al. 
[10], Jain and Varaiya [25]. 

The approach developed in the paper is also useful in networked control applications where 
transmission of real-valued actions to an actuator is not realistic when there is an information 
transmission constraint between a plant, a controller, and an actuator (see, e.g., Yiiksel and Ba§ar 
[46]). On the other hand, the elements of a finite action set can be transmitted across a finite 
capacity information channel. Even though the problem of optimal quantization for information 
transmission from a plant/sensor to a controller has been studied extensively (see, e.g. references 
in Yiiksel and Ba§ar [46]), these type of results appear to be new in the networked control lit¬ 
erature when the problem of transmitting signals from a controller to an actuator is considered. 
Furthermore, tools from information theory allow for obtaining lower bounds on the approximation 
performance; using such an argument we show that the construction in this paper is order-optimal 
for a large class of models. 

The rest of the paper is organized as follows. In Section 2 we study the approximation problem 
(Ql) for MDPs with compact state space. In Section 3 an analogous approximation result is 
obtained for MDPs with non-compact state space. Discretization of the action space is considered in 
Section 4 for a general state space. In Section 5 we derive quantitative bounds on the approximation 
error in terms of the number of points used to discretize the state space for the compact-state case. 
In Section 6 the order optimality of the obtained bounds on the approximation errors is established. 
In Section 7 we present an example to numerically illustrate our results. Section 8 concludes the 
paper. 


1.1. Notation and Conventions. For a metric space E, the Borel cr-algebra (the smallest 
(j-algebra that contains the open sets of E) is denoted by B( E). We let B( E) and C b { E) denote 
the set of all bounded Borel measurable and continuous real functions on E, respectively. For any 
u G Cj,(E) or u G B( E), let \\u\\ := sup e£E \u(e)\ which turns C&(E) and B( E) into Banach spaces. 
Given any Borel measurable function w : E —> [1, oo) and any real valued Borel measurable function 
non E, we define the rc-norm of u as 


m 


:= sup 

e£E 


He) i 

w(e) ’ 
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and let B w ( E) denote the Banach space of all real valued measurable functions u on E with fi¬ 
nite rc-norm; see Hernandez-Lerma and Lasserre [22]. Let V(E) denote the set of all probability 
measures on E. A sequence {g n } of probability measures on E is said to converge weakly (resp., 
setwise) (see Hernandez-Lerma and Lasserre [23]) to a probability measure p if f E g(e)p n (de) —> 
f E g(e)p(de ) for all g £ C b ( E) (resp., for all g £ B( E)). For any p, v £ V(E), the total variation dis¬ 
tance between p and v, denoted as \\p — ^[[tv, is equivalently defined as 

\\n-v\\ TV :=2 sup \n(D) - v{D) \ = sup 

oeB(E) llsll<i 

Unless otherwise specified, the term ‘measurable’ will refer to Borel measurability in the rest of 
the paper. 

1.2. Markov Decision Processes. A discrete-time Markov decision process (MDP) can be 
described by a five-tuple 


g{e)g{de) - / g(e)u(de) 


(X,A,{A (x):x£X},p,c), 

where Borel spaces (i.e., Borel subsets of complete and separable metric spaces) X and A denote the 
state and action spaces, respectively. The collection (A(x) : x £ X} is a family of nonempty subsets 
A(x) of A, which give the admissible actions for the state x £X. The stochastic kernel p( ■ \x,a) 
denotes the transition probability of the next state given that previous state-action pair is (x,a)\ 
see Hernandez-Lerma and Lasserre [21]. Hence, it satisfies: (i) p(-\x,a) is an element of V(X) for 
all (x, a), and (ii) p(D | •, •) is a measurable function from X x A to [0,1] for each D £ B(X). The 
one-stage cost function c is a measurable function from X x A to R. In this paper, it is assumed 
that A(x) = A for all x £ X. 

Define the history spaces H 0 = X and H t = (X x A) 4 x X, t = 1,2,... endowed with their product 
Borel a-algebras generated by B(X) and B( A). A policy is a sequence ir = {-7r f } of stochastic kernels 
on A given H t . The set of all policies is denoted by n. Let <J> denote the set of stochastic kernels <p on 
A given X, and let F denote the set of all measurable functions / from X to A. A randomized Markov 
policy is a sequence ir = {vr t } of stochastic kernels on A given X. A deterministic Markov policy is 
a sequence of stochastic kernels ir = {7r t } on A given X such that 7r t ( • |x) = Sf t ( x )( •) for some f t £ F, 
where 5 Z denotes the point mass at z. The set of randomized and deterministic Markov policies 
are denoted by RM and M, respectively. A randomized stationary policy is a constant sequence 
7 r = {7r t } of stochastic kernels on A given X such that 7r t ( • |x) = ip( ■ |x) for all t for some ip £ <I>. 
A deterministic stationary policy is a constant sequence of stochastic kernels tt = {vr t } on A given 
X such that 7r t ( • |x) = 5/(x)( ■) for all t for some f £ F. The set of randomized and deterministic 
stationary policies are identified with the sets and F, respectively. 

According to the Ionescu Tulcea theorem (see Hernandez-Lerma and Lasserre [21]), an initial 
distribution p on X and a policy 7r define a unique probability measure P* on Hoc = (X x A)°°. The 
expectation with respect to P* is denoted by . If p = 5 X , we write P£ and instead of P|] and 
EJ^. The cost functions to be minimized in this paper are the /3-discounted cost and the average 
cost, respectively given by 


J(tt,x) = EI 


'Y^/3 t c(x t ,a t ) 


t =0 


V(ir,x) = limsup —W x 

T—>oo -L 


■T—1 


^2c(x t ,a t ) 


t =0 
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With this notation, the discounted and average value functions of the control problem are defined 
as 


J*(x) := inf 

7tGII 

V*(x) := inf V(tt,x). 

7tGII 

A policy 7r* is said to be optimal if = J*(x) (or V(ir*,x) = V*(x) for the average cost) for 

all x € X. Under fairly mild conditions, the set F of deterministic stationary policies contains an 
optimal policy for discounted cost (see, e.g., Hernandez-Lerma and Lasserre [21], Feinberg et al. 
[16]) and average cost optimal control problems (under somewhat stronger continuity/recurrence 
conditions, see, e.g., Feinberg et al. [16]). 

Remark 1.1. We note that the path-wise infinite sum YftLo F c ( x ti a t ) may not be well-defined 
in the definition of J if c is only assumed to be measurable. However, further assumptions that 
will be imposed in later sections ensure that J is a well-defined function. 

1.3. Auxiliary Results To avoid measurability problems associated with the operators that 
will be defined for the approximation problem in the discounted cost case, it is necessary to enlarge 
the set of functions on which these operators can act. To this end, in this section we review the 
notion of analytic sets and lower semi-analytic functions, and state the main results that will be 
used in the sequel to tackle these measurability problems. For a detailed treatment of analytic sets 
and lower semi-analytic functions, we refer the reader to Shreve and Bertsekas [36], Blackwell et al. 
[7], Kuratowski [27, Chapter 39], and Bertsekas and Shreve [3, Chapter 7]. 

Let N°° be the set of sequences of natural numbers endowed with the product topology. With 
this topology, N°° is a complete and separable metric space. A subset A of a Borel space E is said 
to be analytic if it is a continuous image of N°°. Note that Borel sets are always analytic. 

A function g : E — > R is said to be universally measurable if for any g G 'P(E), there is a Borel 
measurable function : E —»■ M such that g = g^ g almost everywhere. It is said to be lower semi- 
analytic if the set {e : g(e) < c} is analytic for any c£l. Any Borel measurable function is lower 
semi-analytic and any lower semi-analytic function is universally measurable. The latter property 
implies that the integral of any lower semi-analytic function with respect to any probability measure 
is well defined. We let B l ( E) and B l w ( E) denote the set of all bounded lower semi-analytic functions 
and lower semi-analytic functions with finite ic-norm, respectively. Since any pointwise limit of a 
sequence of lower semi-analytic functions is lower semi-analytic (see Kuratowski [27, Theorem 1, 
p. 512]), (B Z (E),|| • ||) and (J3^(E),|| • ||^) are Banach spaces. 

We now state the results that will be used in the sequel. 

Proposition 1.1. (Bertsekas and Shreve [3, Proposition 7./7, p. 179]) Suppose E : and E 2 are 
Borel spaces. Let g : Ei x E 2 — > M be lower semi-analytic. Then, g*(e i) := inf e2eE2 g(ei, e 2 ) is also 
lower semi-analytic. 

PROPOSITION 1.2. (Bertsekas and Shreve [3, Proposition 7.]8, p. 180]) Suppose Ei and E 2 as 
in Proposition 1.1. Let g : E x x E 2 — be lower semi-analytic and q(de 2 |ei) be a stochastic kernel 
on E 2 given Ei . Then, the function 



g(e 2 )q(de 2 \e 1 ). 


is lower semi-analytic. 
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2. Finite State Approximations of MDPs with Compact State Space. In this section 
we consider (Ql) for the MDPs with compact state space. To distinguish compact-state MDPs 
from non-compact ones, the state space of the compact-state MDPs will be denoted by Z instead of 
X. We impose the assumptions below on the components of the Markov decision process; additional 
new assumptions will be made for the average cost problem in Section 2.2. 

Assumption 2.1. 

(a) The one-stage cost function c is in C b ( Z x A). 

(b) The stochastic kernel p(-\z,a) is weakly continuous in ( z,a), i.e., for all z and a, 
p( ■ | z kl a k ) —>p( ■ | z, a) weakly when (z k ,a k ) —> (z, a). 

(c) Z and A are compact. 

Before proceeding with the main results, we first describe the procedure used to obtain finite- 
state models. Let d z denote the metric on Z. Since the state space Z is assumed to be compact and 
thus totally bounded, one can find a sequence ({z nti }i= i) n>1 °f finite grids in Z such that for all n, 

min d z {z , z ni ) < 1 /n for all z £ Z. 

The finite grid {z n Ai=\ is called an 1/n-net in Z. Let Z„ := {z n> i,..., z n _ kn } and define function Q n 
mapping Z to Z n by 

Q n (z) := argmmd z (z,z n ^), 

where ties are broken so that Q n is measurable. In the literature, Q n is often called a nearest 
neighborhood quantizer with respect to distortion measure d Zl see Gray and Neuhoff [19]. For each 
n, Q n induces a partition of the state space Z given by 

^n,i ^ "Z- • Qn(z) Z"n,i\ i 

with diameter diam(5 n i ) := sup. g5 . d z (z, y) < 2/n. Let {p„} be a sequence of probability mea¬ 
sures on Z satisfying 


u n{S n .i) > 0 for all i,n. (2-1) 

We let v r ni be the restriction of v n to S 7hi dehned by 

( \ - y "(‘) 
u n (S n ,f) m 

The measures is n ,i will be used to define a sequence of finite-state MDPs, denoted as MDP n (n > 1), 
to approximate the original model. To this end, for each n define the one-stage cost function 
c n : Z„ x A —>■ M and the transition probability p n on Z n given Z„ x A by 

c n (z n}i ,a):= c(z,a)v nti (dz), 

J $n,i 

Pn{ • \z n ,i,a) := / Q n *p(-\z,a)is n ^(dz), 

J $n,i 

where Q n *p{ ■ \z,a) £ V{Z n ) is the pushforward of the measure p( ■ \z,a ) with respect to Q n \ that 
is, 

Qn *p(z nJ \z,a) =p(S Utj \z,a ), 

for all z n j £ Z n . For each n, we dehne MDP n as a Markov decision process with the following 
components: Z„ is the state space, A is the action space, p n is the transition probability and c n is 
the one-stage cost function. History spaces, policies and cost functions are dehned in a similar way 
as in the original model. 
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2.1. Discounted Cost. Here we consider (Ql) for the discounted cost criterion with a dis¬ 
count factor (3 G (0,1). Throughout this section, it is assumed that Assumption 2.1 holds. 

Define the operator T on B(Z) by 


Tu(z) 


:= min 

oe a 



u(y)p(dy\z, 



( 2 . 2 ) 


In the literature T is called the Bellman optimality operator. It can be proved that under As¬ 
sumption 2.1-(a)(b), T is a contraction operator with modulus (3 mapping C b ( Z) into itself (see 
Hernandez-Lerma [20, Theorem 2.8, p. 23]); that is, Tu G C b ( Z) for all u G C b ( Z) and 


\\Tu — Tv\\ <f3\\u — v\\ for all u,v G C b (Z). 


The following theorem is a widely known result in the theory of Markov decision processes (see again 
Hernandez-Lerma [20, Theorem 2.8, p. 23]) which also holds without a compactness assumption 
on the state space. 

Theorem 2.1. The value function J* is the unique fixed point in C b ( Z) of the contraction 
operator T, i.e., 


J* = T J*. 


Furthermore, a deterministic stationary policy f* is optimal if and only if it satisfies the optimality 
equation, i.e., 


J*(z) = c(z,f*(z)) + (3 J J*(y)p(dy\z,f*(z)). 


(2.3) 


Finally, there exists a deterministic stationary policy f* which is optimal, so it satisfies (2.3). 
Define, for all n > 1, the operator T n , which is the Bellman optimality operator for MDP n , by 


T n u{z n i) := min 

og A 


C-n (,i ? R) T fi ^ ) Uj(z n j )Pn \ %n,i ; R) 


i=i 


or equivalently, 


x n u{z nti = mm 

06 A 


c(z,a) + (3 J u(y)p(dy\z,a) 


v n ,i{dz), 


where u : Z n —> M and u is the piecewise constant extension of u to Z given by u(z) = u o Q n (z). For 
each n, under Assumption 2.1, Hernandez-Lerma [20, Theorem 2.8, p. 23] implies the following: (i) 
T n is a contraction operator with modulus (3 mapping B(Z n ) (= C b { Z„)) into itself, (ii) the fixed 
point of T n is the value function J* of MDP n , and (iii) there exists an optimal stationary policy 
/* for MDP„, which therefore satisfies the optimality equation. Hence, we have 


j* n =T n j* n =T n j n (f:,-)=j n (f „*,■), 


where J n denotes the discounted cost for MDP n . Let us extend the optimal policy /* for MDP„ 
to X by letting f n (z) = /’oQ„(z) G F. 

The following theorem is the main result of this section. It states that the cost function of the 
policy f n converges to the value function J* as n —>• oo. 
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Theorem 2.2. The discounted cost of the policy f n , obtained by extending the optimal policy 
f* of MDP n to Z ; converges to the optimal value function J* of the original MDP 

lim ||</(/„, ■)-J*||=0. 

n—¥ oo 


Hence, to find a near optimal policy for the original MDP, it is sufficient to compute the optimal 
policy of MDP n for sufficiently large n, and then extend this policy to the original state space. 

To prove Theorem 2.2 we need a series of technical results. We first define an operator T n on 
B l { Z) by extending T n to B l ( Z): 


T„„( 2 ) :=inf 


f 

c(x,a) + fi / u(y)p(dy\x,a) 

Sn,in(z) 

J z 


^n,in(z) {dx) , 


(2.4) 


where i n : Z —> {1,..., k n } maps z to the index of the partition {iS n , : } it belongs to. To see that this 
operator is well defined, let the stochastic kernel r n {dx\z ) on Z given Z be defined as 


kn 

r n (dx\z ) :=y^i' nti (dx)ls ni {z), 

i =1 


where 1 B denotes the indicator function of the set B. Then, we can write the right hand side of 
(2.4) as 


inf 

og A 


J c(x,a)+/3 J u(y)p(dy\x,a) 


fdx\ 


Therefore, by Propositions 1.1 and 1.2, we can conclude that T n maps B l (Z.) into B l (Z). Further¬ 
more, it is a contraction operator with modulus /3 which can be shown using Hernandez-Lerma 
[20, Proposition A.2, p. 122], Hence, it has a unique fixed point J* that belongs to B( Z), and this 
fixed point must be constant over the sets S rhi because of the averaging operation on each S n j. 
Furthermore, since T n [uoQ n ) = ( T n u ) o Q n for all u £ B(Z n ), we have 


t(J: o Qn) = (: T n J* n ) O Q n = r n o Q n . 


Hence, the fixed point of T n is the piecewise constant extension of the fixed point of T n , i.e., 

r n = T n oQ n . 


Remark 2.1. In the rest of this paper, when we take the integral of any function with respect 
to v n ,i n ( z ), it is tacitly assumed that the integral is taken over all set S n ^ z ). Hence, we can drop 
S n .i n (z) in the integral for the ease of notation. 

We now define another operator F n on B l { Z) by simply interchanging the order of the infimum 
and the integral in (2.4), i.e., 


F n u(z ) := / inf 

/ aGA 


c(x,a) + /3 J u(y)p(dy\x,( 


v n:iri{z) (dx) 


= T n Tu(z), 


where 


T n u(z):= / u(x)v ntiri{z) (dx). 
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We note that F n is the extension (to infinite state spaces) of the operator defined in Van Roy 
[37, p. 236] for the proposed approximate value iteration algorithm. However, unlike in Van Roy 
[37], F n will serve here as an intermediate point between T and T n (or T n ) to solve (Ql) for the 
discounted cost. To this end, we first note that F n is a contraction operator on B l ( Z) with modulus 
/?. Indeed it is clear that F n maps B\ Z) into itself by Propositions 1.1 and 1.2. Furthermore, for 
any u,v € B l ( Z), we clearly have HIV it — T„i;|| < \\u — i>||. Hence, since T is a contraction operator 
on B l ( Z) with modulus /3, F n is also a contraction operator on B l ( Z) with modulus /?. 

Remark 2.2. Since we only assume that the stochastic kernel p is weakly continuous, it is not 
true that T n and F n map B(Z) into itself (see Hernandez-Lerma and Lasserre [21, Proposition D.5, 
p. 182]). This is the point where we need to enlarge the set of functions on which these operators 
act. 

The following theorem states that the fixed point, say it*, of F n converges to the fixed point J* 
(i.e., the value function) of T as re goes to infinity. Note that although T is originally defined on 
C b ( Z), it can be proved that T, when acting on B l ( Z), maps B l ( Z) into itself. 

Theorem 2.3. If it* is the unique fixed, point of F n , then lim^oo ||it* — J*|| = 0. 

The proof of Theorem 2.3 requires two lemmas. 

Lemma 2.1. For any u € B l (Z), we have 

||it — r„it|| < 2 inf ||n — <h r ||, 


where <I> r (z) = Y!l™ 1 r i l Sn .{z), r = {r 1 ,r k J. 

Proof. Fix any r £ Z kn . Then, using the identity T,,^ = d> r , we obtain 


\\u-T n u 


< ||it — <h r || + ||<P r — r„it|| 

= ||u - d> r jj + |jr„$v - r„u 

< ||u — <hj.ll + ||<I> ) , — u||. 


Since r is arbitrary, this completes the proof. □ 

Notice that because of the operator r„, the fixed point it* of F n must be constant over the sets 
S n ,i. We use this property to prove the next lemma. 

Lemma 2.2. We have 


Ire* — J*|| < 


—-—- inf || «/* — <h. r 

1 — p rGZ k n 


Proof. Note that r„it* = it* since it* is constant over the sets S riti . Then, we have 


||< - J*|| < IK - r n j*|| + ||r„J* - J*|| 

= \\F n u* n -T n Tj*\\ + \\r n r - r\\ 

= jjr „T< - r n TJ*\\ + ||r„ J* - J*|| (by the definition of F n ) 

< \\Tu* n — TJ *|| + ||r n J* — J*|| (since ||r„it — r„i;|| < ||u — 1 >||) 

</3||<-j*|| + ||r n j*-j*||. 

Hence, we obtain j|it* — J*|| < T n J* — J*||. The result now follows from Lemma 2.1. □ 

Proof of Theorem 2.3. Recall that since Z is compact, the function J* is uniformly continuous 
and diam(«S nii ) < 2/re for all 7 = 1,... ,k n . Hence, lim^oo inf reZ fc n || J* — <h r || = 0 which completes 
the proof in view of Lemma 2.2. □ 

The next step is to show that the fixed point J* of T n converges to the fixed point J* of T. To 
this end, we first prove the following result. 




10 


Saldi, Yiiksel, and Linder: Asymptotic Optimality of Finite Approximations to MDPs 
Mathematics of Operations Research 00(0), pp. 000—000, ©0000 INFORMS 


Lemma 2.3. For any uEC b (Z), \\T n u — F n u\\ —» 0 as n —> oo. 

Proof. Note that since f z u(x)p(dx\y, a) is continuous as a function of ( y,a ) by Assumption 2.1- 
(b), it is sufficient to prove that for any l £ C b (Z x A) 


min 

a 


/ 


l{y,a)vn,i n (z){dy) - / mml(y,a)v nMz) (dy) 


:= sup 

zGZ 


min / l(y,a)u ntin{z) (dy) - / mml(y,a)u nAniz) (dy) 


^0 


as n — > oo. Fix any e > 0. Define := |J n Z„ and let {ai}^ be a sequence in A such that 

min aeA l(z i: a) = l(zi,a t ); such cq exists for each Zi because l(zi , •) is continuous and A is com¬ 
pact. Define g(y) := min agA l(y, a), which can be proved to be continuous, and therefore uniformly 
continuous since Z is compact. Thus by the uniform continuity of l, there exists 5 > 0 such that 
dzxk^iy, a), {y\ o-')) < d implies \g(y ) — g(y') \ < e/2 and \l(y, a) — l(y r , a')| < e/2. Choose n 0 such that 
2/n 0 < S. Then for all n > n 0 , max ig {i i ... jfen } diam(<S fli i) < 2/n < 5. Hence, for all y £ S n ^ we have 
\l(y, a i) -mm aeA l(y,a)\ < \l(y,a i )-l(z ll a i )\ + \mm a&A l(z i ,a)-mm aGA l(y,a)\ = |%,a i )-ZK<r)l + 
\g( z i) — g(y) \ < £• This implies 


min 

a 


/ 


l{y,a)vn,i n ( z) (dy) - / min l(y,a)v ntin{z) {dy) 


j Ky,ai)v n ,i„(z){dy) - j mml(y,a)v niin(z) (dy) 

— sup / sup \l(y,ai)-mml(y,a)\iz niin{z) (dy) <e 

z£Z J yeS„_,{,\ a 


This completes the proof. □ 

Theorem 2.4. The fixed point J* ofT n converges to the fixed point J* ofT. 
Proof. We have 


\j:-j * 


< II Tj* n - f„J*|| + II f n J* - F n r || + ||F„ J* - F n u *II 

+ \\Fn U n ~ T* || 

< 0IK - J*\\ + \\T n J* - F n J* II + 011 J* - <|| + IK - .7*11. 


Hence 


ii < 


\\T n J*-F n J*\\ + (l + P)\\J*-u* n \ 

1-0 


The theorem now follows from Theorem 2.3 and Lemma 2.3. □ 

Recall the optimal stationary policy /* for MDP„ and its extension f n (z) = f* ° Q n {z) to Z. 
Since J* = J* o Q n , it is straightforward to prove that f n is the optimal selector of T„J*; that is, 


r T T* _ 7* _ 7K 7* 

- L n^n — J n ~ ± f n J nJ 


where T? is defined as 

Jn 


T lA z ) := 

Dehne analogously 


c(x,f n (x)) + P / u{y)p(dy\x, f n (x)) 


v n,i n (z) ( dx ). 


(z) := c(z, f n (z)) + P J u{y)p(dy\zj n (z)). 


Tf u[z):= 
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It can be proved that both Tj n and Tj n are contraction operators on B l (T) with modulus f3, and 
it is known that the fixed point of T ^ is the true cost function of the stationary policy f n (i.e., 
J(fn,z)). 

Lemma 2.4. || Tf n u — T f - n u\\ —> 0 as n —> oo, for any u £ C b ( Z). 

Proof. The statement follows from the uniform continuity of the function c(z,a) + 
/3 f z u(y)p(dy\z, a) and the fact that f„ is constant over the sets S n j. □ 

Now, we prove the main result of this section. 

Proof of Theorem 2.2. We have 


l’(L ■)- -HI < II Tfj(L-)-T,J- II + 1 | T f j- -T k .r II + \\\r-t k r n \\ + \\r n - r 
< P\\J(L • ) - J'll + \\ T h r - T h J ’\\ + PW r - J «H + IK - • / ’ll- 


Hence, we obtain 


II J{fn,-)-r\\< 


\\Tf n J* — Tf n J*\\ + (1 + /3)\\J* — J*\ 

1-P 


The result follows from Lemma 2.4 and Theorem 2.4. □ 


2.2. Average Cost. In this section we impose some new conditions on the components of 
the original MDP in addition to Assumption 2.1 to solve (Ql) for the average cost. A version 
of the first two conditions was imposed in Vega-Amaya [38], Jaskiewicz and Nowak [26] to show 
the existence of the solution to the Average Cost Optimality Equation (ACOE) and the optimal 
stationary policy. 

Assumption 2.2. Suppose Assumption 2.1 holds with item (b) replaced by condition (f) below. 
In addition, there exist a non-trivial finite measure ( on Z. a nonnegative measurable function 9 
on Z x A. and a constant A £ (0,1) such that for all (z,a) £ Z x A 

(d) p(B\z,a ) > ((B)9(z,a) for all B £ B{ Z), 

(e) ^ <9(z,a), 

(f) The stochastic kernel p{- \z, a) is continuous in ( z,a ) with respect to the total variation dis¬ 
tance. 

Throughout this section, it is assumed that Assumption 2.2 holds. Observe that any deterministic 
stationary policy / defines a stochastic kernel p(-\z, f(z)) on Z given Z which is the transition 
probability of the Markov chain { z t }^ :1 (state process) induced by /. For any t > 1, let us write 
p t (-\z, f(z)) to denote the t-step transition probability of this Markov chain given the initial point 
z\ that is, p t ( ■ | z,f(z)) is recursively defined as 

p t+1 ( ■ \z , f(z)) = j^p{ ■ \x, f{x))p\dx\z, f(z)). 

To study average cost optimal control problems, it is in general assumed that there exists an in¬ 
variant distribution under any stationary control policy, so that the average cost of any stationary 
policy can be written as an integral of the one-stage cost function with respect to this invariant 
distribution. With this representation, one can then deduce the optimality of stationary policies 
using the linear programming or the convex analytic methods (see Hernandez-Lerma and Lasserre 
[21], Borkar [8]). However, to solve the approximation problem for the average cost, we need, in 
addition to the existence of an invariant distribution, the convergence of t-step transition proba¬ 
bilities to the invariant distribution, at some rate, for both the original and the reduced problems. 
Therefore, it is crucial to impose proper conditions on the original model so that, on the one hand, 
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they guarantee the convergence of f-step transition probabilities to the invariant distribution for 
all stationary policies for the original system and, on the other hand, one is able to show that 
similar conditions are satisfied by the reduced problems. Conditions (d) and (e) in Assumption 2.2 
are examples of such conditions which were also used in the literature extensively. Indeed, if we 
define the weight function w = 1, then condition (e) corresponds to the so-called ‘drift inequality’: 
for all (z, a) G Z x A 

J w(y)p(dy\z,a ) < Xiu(z) + C(w)0(z, a), 

and condition (d) corresponds to the so-called ‘minorization’ condition, both of which were used in 
literature for studying geometric ergodicity of Markov chains (see Hernandez-Lerma and Lasserre 
[22], Meyn and Tweedie [29], and references therein). 

The following theorem is a consequence of Vega-Amaya [38, Theorem 3.3], Gordienko and 
Hernandez-Lerma [18, Lemma 3.4], and Jaskiewicz and Nowak [26, Theorem 3], which also holds 
with Assumption 2.2-(f) replaced by Assumption 2.1-(b). 

Theorem 2.5. For any f £ F, the stochastic kernel p(-\z, f(z)) is positive Harris recurrent 
with unique invariant probability measure pf. Therefore, we have 

V= J f( z ))Tf(dz) =: p f . 

The Markov chain {z t }^ 1 induced by f is geometrically ergodic; that is, there exist positive real 
numbers R and k < 1 such that for every z £ Z 

sup \\p\ ■ \z, f(z)) - pf\\ T v < 
few 

where R and k continuously depend on £(Z) and A. Finally, there exist f* e F and h* € B( Z) such 
that the triplet (h*, f*,pf*) satisfies the average cost optimality equality (ACOE), i.e., 


Pf* + h*(z ) = min 


a£ A 

= c 


c(z,a)+ / h*{y)p(dy\z,a) 


L i- J 

+ J h*{y)p(dy\z,f*(z)), 


and therefore, 


inf V(tt,z) =: V*(z)=p r . 

7TGII 

For each n, define the one-stage cost function b n : Z x A —> [0,oo) and the stochastic kernel q n on 
Z given Z x A as 

b„{z,a ) := J c(x,a)v n>in ( z )(dx), 
q n {' \z, a) := J p(-\x,a)v ntin ( z )(dx). 

Observe that c n (i.e., the one stage cost function of MDP„) is the restriction of b n to Z„, and p n 
(i.e., the stochastic kernel of MDP„) is the pushforward of the measure q n with respect to Q n ; that 
is, c n (z nA ,a) = b n {znjMf) for alH = l,...,fc n and p n ( ■ \ z n>u a) = Q n *q n (-\z n>i ,a). 

For each n, let MDP n be defined as a Markov decision process with the following components: 
Z is the state space, A is the action space, q n is the transition probability, and c is the one-stage 
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cost function. Similarly, let MDP„ be defined as a Markov decision process with the following 
components: Z is the state space, A is the action space, q n is the transition probability, and b n 
is the one-stage cost function. History spaces, policies and cost functions are defined in a similar 
way as before. The models MDP n and MDP„ are used as transitions between the original MDP 
and MDP„ in a similar way as the operators F n and T n were used as transitions between T and 
T n for the discounted cost. We note that a similar technique was used in the proof of Ortner [30, 
Theorem 2], which studied the approximation problem for finite state-action MDPs. In Ortner [30] 
the one-stage cost function is first perturbed and then the transition probability is perturbed. We 
first perturb the transition probability and then the cost function. However, our proof method is 
otherwise quite different from that of Ortner [30, Theorem 2] since Ortner [30] assumes finite state 
and action spaces. 

We note that a careful analysis of MDP n reveals that its Bellman optimality operator is essen¬ 
tially the operator T n . Hence, the value function of MDP„ is the piecewise constant extension of 
the value function of MDP n for the discounted cost. A similar conclusion will be made for the 
average cost in Lemma 2.5. 

First, notice that if we define 


9 n (z,a) := I 9(y,a)u n>in{z) (dy), 

Cn '■= Qn * C pushforward of £ with respect to Q n ), 

then it is straightforward to prove that for all n, both MDP„ and MDP n satisfy Assumption 2.2- 
(d),(e) when 6 is replaced by 9 n , and Assumption 2.2-(d),(e) is true for MDP„ when 9 and £ are 
replaced by the restriction of 9 n to Z n and £„, respectively. 

Hence, Theorem 2.5 holds (with the same R and k) for MDP n , MDP„, and MDP„ for all n. 
Therefore, we denote by /*, f* and f* the optimal stationary policies of MDP„, MDP n , and MDP n 
with the corresponding average costs pt£, p n - t and p^*, respectively. 

Furthermore, we also write pf, p", and pf to denote the average cost of any stationary policy / 

for MDP n , MDP„, and MDP n , respectively. The corresponding invariant probability measures are 
also denoted in a similar manner, with p, replacing p. 

The following lemma essentially says that MDP„ and MDP n are not very different. 

Lemma 2.5. The stationary policy given by thejpiecewise constant extension of the optimal 
policy f* of MDP n to Z (i.e., f*°Q n ) is optimal for MDPwith the same cost function pf». Hence, 
fn = fn ° Qn and p^ = p n f * . 

Proof. Note that by Theorem 2.5 there exists h* n £ B( Z„) such that the triplet (h* , f*,pj *) 
satisfies the ACOE for MDP„. But it is straightforward to show that the triplet (/i* o Q n ,f* o 
QmPf*) satisfies the ACOE for MDP„. By Gordienko and Hernandez-Lerma [18, Lemma 5.2], 
this implies that f* o Q n is an optimal stationary policy for MDP„ with cost function pf,. Hence 
fn=fn° Qn &nd pj, = p], . □ 

The following theorem is the main result of this section. It states that if one applies the piecewise 
constant extension of the optimal stationary policy of MDP n to the original MDP, the resulting 
cost function will converge to the value function of the original MDP. 

Theorem 2.6. The average cost of the optimal policy f* for MDP n , obtained by extending the 
optimal policy f* of MDP n to Z. converges to the optimal value function J* = pf* of the original 
MDP, i.e., 
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Hence, to find a near optimal policy for the original MDP, it is sufficient to compute the optimal 
policy of MDP n for sufficiently large n, and then extend this policy to the original state space. 

To show the statement of Theorem 2.6 we will prove a series of auxiliary results. 

Lemma 2.6. For all t > 1 we have 

lirn sup \\p t {-\y,f(y))-q t n {-\y,f{y))\\ =^. 

n_> 00 (y,/)e Zxf 

Proof. We will prove the lemma by induction. Note that if one views the stochastic kernel 
p( ■ | z,a) as a mapping from Z x A to 'P(Z), then Assumption 2.2-(f) implies that this mapping is 
continuous, and therefore uniformly continuous, when V{Z) is equipped with the metric induced 
by the total variation distance. 

For t = 1 the claim holds by the following argument: 

sup \\p{-\y,f{y))-q n (-\y,f(y))\\ :=2 sup sup \p(D\y, f(y)) - q n (D\y, f(y))\ 

(y,/)eZxF ( y ,/)eZxFDeB(Z) 

<2 sup sup \p{D\y,f(y))-p(D\zJ(y))\u n My)(dz) 

(y,/)eZxF£>eB(Z) J 

< sup \\p(-\y,f(y))-p(-\z,f(y))\\ TV v n ,in(v)(dz) 

(y,/)ezxF J 

<sup sup \\p( \y,a) -p(-\z,a)\\ TV . 
yez (z,a)es n in(y) x A 

As the mapping p( ■ \z, a) : Z x A —> V(Z) is uniformly continuous with respect to the total variation 
distance and max n i diam(iS„. i ) —>■ 0 as n —> oo, the result follows. Assume the claim is true for t > 1. 
Then we have 


sup \\p t+1 {- \y,f(y)) |y,/(y))|| 


(yJ)GZxF 


TV 


:= sup sup 

(y,/)6ZxF|| g ||<l 


< sup 


sup 


(y,/)eZxF\||g||<l 


+ sup 

< sup ||p 4 (- 
(y,/)e zxF 


J^g{x)p t+ 1 (dx\y, f(y)) - J^g{x)q t n + 1 (dx\y, f(y)) 

g(x)p(dx\zj(z))p\dz\yj(y))- [ [g(x)p(dx\zj(z))q t n (dz\y,f(y)) 


Z J z 


ZJZ 


ZJZ 


g{x)p(dx\z, f(z))q t n (dz\y, f(y)) — / / g(x)q n {dx\z,/(z^qlidz^, f(y)) 


ZJZ 


yJ(y))-q t n(-\yJ(y))\\ TV + sup \\p{-\z,f(z))-q n (-\zj(z))\\ 


(z,f)e zxf 


TV 


(2.5) 


where the last inequality follows from the following property of the total variation distance: for 
any h € B( Z) and £ V(Z) we have \f z h(z)p(dz) — f z h{z)v{dz)\ < ||h||||/r — v\\ T y By the first 
step of the proof and the induction hypothesis, the last term converges to zero as n —> oo. This 
completes the proof. □ 

Remark 2.3. This is the point where we need the continuity of the transition probability p 
with respect to the total variation distance. If we assume that the stochastic kernel p is only weakly 
or setwise continuous, then it does not seem possible to prove a result similar to Lemma 2.6 for 
the weak and the setwise topologies. 

Using Lemma 2.6 we prove the following result. 

Lemma 2.7. We have sup^ eir \pf — pf\ —> 0 asn-^oo, where pf is the cost function of the policy 
f for MDP n and p/ is the cost function of the policy f for the original MDP. 
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sup \f)f — Pf\= sup 
/e f /eF 

< sup 
fe f 


Proof. For any t> 1 and p G Z we have 

J c ( z > f(z))fbf(dz) - J r c(z, f(z))n f (d. 

J c(z, f{z))F}{dz) - j c(z , /(*))9* ( d *ll/> /(y)) 

J c ( z J{ z ))<fn( dz \yJ(.y))-J c( z J( z ))p\ dz \yJ(,y)) 

J c{z,f{z))p t {dz\y,f{y)) - J c(z, f(z))p f (d. 


+ sup 

/6F 


+ sup 

/eF 


< 2i2/c t ||c|| + ||c|| sup \\q t n (-\yJ(y))-p t (-\y,f(y))\\ TV (by Theorem 2.5- (ii)), 

(3/,/)eZxF 


where R and k are the constants in Theorem 2.5. Then, the_result follows from Lemma 2.6. □ 

The following theorem states that the value function of MDP„ converges to the value function 
of the original MDP. 

Lemma 2.8. We have \p n ft — pp* | —>• 0 as n —>• oo. 

J n 

Proof. Notice that 


I P 7 }* ~Pf* I = max(/5"-, - p s *, p f * - p),) 

Jn Jn Jn 

< max(p”, - p y * ,p/* - p"-,) 

J ^ ri j ri 

— sup | Pj P/1 • 

/ 


Then, the result follows from Lemma 2.7. □ 

Lemma 2.9. We /lave sup /6F \pf — pj\ —> 0 asro—>-oo. 

Proof. It is straightforward to show that b n —> c uniformly. Since the probabilistic structure of 
MDP n and MDP ra are the same (i.e., fij = fif for all /), we have 


sup \p n f -p 1 } | = sup 

/e f /eF 



(zj(z))fi](dz) 



f{z))F}{dz) 


— sup / \b n (z,f(z))-c(z,f(z))\p,f(dz) 
/e f J z 

< 11 h n c 11. 


This completes the proof. □ 

The next lemma states that the difference between the value functions of MDP ra and MDP„ 
converges to zero. 

Lemma 2.10. We have \pf, — p n ft | — >0 as n->-oo. 

J n Jn 

Proof. See the proof of Lemma 2.8. □ 

The following result states that if we apply the__optimal policy of MDP n to MDP„, then the 
resulting cost converges to the value function of MDP n . 

Lemma 2.11. We have |p~ t — p n ft | — >-0 asn->oo. 

J n Jn 

Proof. Since |pT — p n ft \ < \p r ~ t: — pT | + |pT — p n ~ |, then the result follows from Lemmas 2.9 and 

J n Jn Jn Jn Jn Jn 

2 . 10 . □ 

Now, we are ready to prove the main result of this section. 

Proof of Theorem 2.6. We have |p f ~» — Py-1 < Ip/* — pT I + |pT — p% | + | p 1 }, — p r L \ . The result now 
follows from Lemmas 2.7, 2.11 and 2.8. □ 
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3. Finite State Approximations of MDPs with Non-Compact State Space. In this 
section we consider (Ql) for noncompact state MDPs with unbounded one-stage cost. To solve 
(Qi), we use the following strategy: (i) first, we define a sequence of compact-state MDPs to 
approximate the original MDP, (ii) we use Theorems 2.2 and 2.6 to approximate the compact-state 
MDPs by finite-state models, and (iii) we prove the convergence of the finite-state models to the 
original model. In fact, steps (ii) and (iii) will be accomplished simultaneously. 

We impose the assumptions below on the components of the Markov decision process; additional 
assumptions will be imposed for the average cost problem. With the exception of the local com¬ 
pactness of the state space, these are the usual assumptions used in the literature for studying 
Markov decision processes with unbounded cost. 

Assumption 3.1. 

(a) The one-stage cost function c is continuous. 

(b) The stochastic kernel p( ■ \x,a) is weakly continuous in (x,a). 

(c) X is locally compact and A is compact. 

(d) There exist nonnegative real numbers M and a £ [1,^), and a continuous weight function 
w : X —> [1, oo) such that for each x £ X, we have 

sup | c(x, a)| < Mw(x), 
ae a 

sup / w(y)p(dy\x,a) < aw(x), 
ae a Jx 

and f x w(y)p(dy\x, a) is continuous in ( x,a ). 

Since X is locally compact separable metric space, there exists a nested sequence of compact sets 
{K„ } such that K n C int K n+1 and X = (J^ K n Aliprantis and Border [1, Lemma 2.76, p. 58]. 

Lemma 3.1. For any compact subset K of X and for any e > 0, there exists a compact subset 
K e ofX such that 

sup / w(y)p(dy\x,a)<£, 

(x,a)eK xA J Kg 

where D c denotes the complement of the set D. 

Proof. We prove the lemma by contradiction. Assume the claim is wrong. Since every compact 
subset K of X is a subset of K n for some n, the negation of the above lemma is equivalent to the 
following statement: there exists a compact set K C X and e > 0 such that for all n > 1 we have 

sup / w(y)p(dy\x, a) > e. 

(x,a)eKxAJ 

Note that w is integrable with respect to the probability measures in the set |p( • \x,a) : (x,a) £ 
K x A} since 


(3.1) 

(3.2) 


sup / w(y)p(dy\x, a) < a sup w(x) < oo. 

(x,a)(zKxAJX x£K 

For each n, we prove that f intK ^ c w(y)p(dy\x,a ) is an upper semi-continuous function on Tv x A. 
Recall that f x w(y)p(dy\x,a) is a continuous function of (x,a). Let (x k ,a k ) —> (x,a) in K x A. Then 
p( ■ \x k ,a k ) —>p{ ■ | x,a) weakly and f x w(y)p(dy\x k , a k ) —> f x w(y)p(dy\x, a) by our assumption. If we 
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take fk = g k = f = 9 = w in Serfozo [35, Theorem 3.3], this result implies that u k { ■) —x u( ■) weakly, 
where 


v k {D) = I w(y)p(dy\x k ,a k ) 
u{D)= / w(y)p(dy\x,a), 


ID 


for all D € £>(X). Then, by Bartoszynski [2, Theorem A] we have 


/ w(y)p(dy\x,a) := z/((int K n ) c ) 

J (int K n ) c 


> limsup u k ((int K n ) c ) := limsup / w(y)p(ciy|x fe , a fe ). 

k—^oo J(\ntK n ) c 


k—>oo 


Hence, f intK ^ c w(y)p(dy\x,a ) is upper semi-continuous. Since K x A is compact, there exists 
(x n , a n ) G K x A such that 


sup / 

(x,a)GK X A J (int K n ) c 


w{y)p(dy\x,a) = 


I (intif„) c 


^(y)p(^y|*n,a n 


The sequence {(a: n ,a n )} (being a sequence in a compact set K x A) has an converging subsequence 
{( x nk , a nk )} with the limit (x, a) € K x A. Then, for all m > 2, we have 



w(y)p(dy\x,a)> w(y)p(dy\x,a 

J (int Km) c 


> lim sup 


J, 


k—>oo J (int Km) c 


w(y)p(dy\ 


> lim sup / w(y)p(dy|x„ fc ,a njb ) > e, 

k—>oo J (int Kn^ ) c 


where the third inequality follows from the fact that (int K m ) c D (int K nk ) c for k sufficiently large. 
But this is a contradiction because w is p{ ■ \x,a) integrable. □ 

Let {v n } be a sequence of probability measures such that for each n > 1, u n G V(K°) and 


< oo, 


An := / w(x)v n (dx) 

JkC u 

7 = sup r n := sup max< 0, sup / (y„ - w(y)) p{dy\x, a) > < oo. 

n n l (x,a)£XxA J K£ J 


(3.3) 

(3.4) 


For example, such probability measures can be constructed by choosing x n € such that w(x n ) < 
inf xeK c w(x) + i and letting v n ( ■) ^ S Xn ( ■). 

Similar to the finite-state MDP construction in Section 2, we define a sequence of compact- 
state MDPs, denoted as c-MDP„, to approximate the original model. To this end, for each n let 
X„ = K n U {A n }, where A n € K% is a so-called pseudo-state. We define the transition probability 
p n on X„ given X„x A and the one-stage cost function c n : X„ x A —> [0, oo) by 


p„(-\x,a) 

c n (x,a) 


p{ - nK n \x,a) +p(K°\x,a)8 An , 

IkAp(' nK n \z,a) +p(K^\z,a)8 An jD n (dz), 


c(x,a), if x £ K n 

f c(z, a)v n (dz), if x = A n 


if x € K n 
if x = A„, 
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With these definitions, c-MDP„ is defined as a Markov decision process with the components 
(X„,A ,p n ,c n ). History spaces, policies, and cost functions are defined in a similar way as in the 
original model. Let n ra , <f>„, and F ra denote the set of all policies, randomized stationary policies 
and deterministic stationary policies of c-MDP„, respectively. For each policy n £ n„ and initial 
distribution y, £ V(X n ), we denote the cost functions for c-MDP„ by J n (ir, n) and V n (Ti, y). 

To obtain the main result of this section, we introduce, for each n, another MDP, denoted by 
MDP n , with the components (X,A ,q n ,b n ) where 

p( ■ \x, a), if x £ K n 

f K cP(-\z,a)v n (dz), if x € 

n 

c(x,a), if x £ K n 

f c(z,a)is n (dz), if x € K£. 

n 

For each policy n £ n and initial distribution y £ T’(X), we denote the cost functions for MDP„ by 
J n ( vt,a) and V n (ir,y). 


q n (■ | x,a) = 
b n (x,a) = 


3.1. Discounted Cost. In this section we consider (Qi) for the discounted cost criterion 
with a discount factor /3 £ (0,1). Throughout this section, it is assumed that Assumption 3.1 holds. 
The following result states that c-MDP„ and MDP n are equivalent for the discounted cost. 

Lemma 3.2. We have 


Jf{x), ifx£K n 

J n( A n), if X £ K°, 


(3.5) 


where J* is the discounted value function of MDP n and J* is the discounted value function of c- 
MDP n , provided that there exist optimal deterministic stationary policies for MDP n and c-MDP n . 
Furthermore, if, for any deterministic stationary policy f GF„, we define f(x) = f(x) on K n and 
f(x) = /(A„) on K°, then 


Jn(f,x) 


J n (f,x), if x £ K n 

J n (/,A„), if x £ K^. 


(3.6) 


In particular, if the deterministic stationary policy f* £ F n is optimal for c-MDP n , then its exten¬ 
sion f* to X is also optimal for MDP n . 

Proof. The proof of (3.6) is a consequence of the following facts: b n (x,a) = h n (y, a) and 
q n ( ■ \x, a) = q n ( ■ \y, a) for all x,y £ K° and a £ A. In other words, K° in MDP„ behaves like the 
pseudo state A n in c-MDP„ when / is applied to MDP n . 

Let F.„ denote the set of all deterministic stationary policies in F which are obtained by extending 
policies in F n to X. If we can prove that minj gF J n (f, x) = min^ eFn J n (f, x) for all x £ X, then (3.5) 
follows from (3.6). Let /gF\F„. We have two cases: (i) J n {f,z) = J n (f,y) for all z,y£K% or (ii) 
there exists z,y £ such that J n (f,z ) < J n (f,y). 

For the case (i), if we define the deterministic Markov policy tt° as n 0 = {/o, /,/,...}, where 
fo(x) = f(z) on for some fixed 2 £ and fo(x) = f(x) on K n , then using the expression 

J„(7r°,x) = b n (x,f 0 (x)) + /3 [ J n (f,x')q n (dx'\x,f 0 (x)), (3.7) 

Jx 

it is straightforward to show that J n {r r°, x) = J n (f , x) on K n and J n (TT°,x) = J n (f, z ) on LQ. There¬ 
fore, J n (TT°,x) = J n (f,x ) for all x £ X since J n (f,x ) = J n (f,z ) for all x £ K For all t > 1 define 











Saldi, Yiiksel, and Linder: Asymptotic Optimality of Finite Approximations to MDPs 
Mathematics of Operations Research 00(0), pp. 000—000, ©0000 INFORMS 


19 


the deterministic Markov policy 7r* as 7r* = {/ 0 ,7r t_1 }. Analogously, one can prove that J n f r 4 ,x) = 
J„(7r t+1 ,x) for all igX. Since J„( 7t 4 ,x) —>• J„(/o,x) as t —> oo, we have J n (f 0 ,x ) = «/„(/, x) for all 
x € X, where / 0 6F„. 

For the second case, if we again consider the deterministic Markov policy 7r° = {/ 0 , /, /,...}, then 
by (3.7) we have J n (TT°,y) = J n (f,z ) < J n (f,y). Since min /eF «/„(/,y) < J n (ir°,y), this completes 
the proof. □ 

For each n, let us define w n by letting w n (x) = w(x) on K n and w n (x) = l' c w(z)v n (dz ) =: 7 „ on 
K c n . Hence, w n £ L>(X) by (3.3). 

Lemma 3.3. For all n and x £ X, f/ie components of MDP n satisfy the following: 


sup \b n (x, o)| < Mw n (x) (3-8) 

ae A 

sup / w n (y)q n (dy\x,a) < aw n (x) + 7 , (3.9) 

a£A Jx 

where 7 is i/ie constant in (3-4). 

Proof. It is straightforward to prove (3.8) by using the definitions of b n and w n , and the equa¬ 
tion (3.1). To prove (3.9), we have to consider two cases: x £ K n and x £ K^. For the first case, 
q n ( ■ |x, a) =p(- |x, a), and therefore, we have 


sup / w n (y)p(dy\x,a) = sup 
ae A Jx a£ A 


w(y)p(dy\x,a) 


} K c 


( 7 n-w{yj) p{dy\x,a ) 


<sup / w(y)p(dy\x,a) + 'y (by (3.4)) 

ae A Jx 

< au;(x) + 7 = cra; ra (x) + 7 (as w n = w on 7f ra ). 


For x £ K^, we have 


sup / w„(y)g„(dy|x,a) = sup / ( / w; n (y)p(dy|^,a) )^ n (d 2 

oeA Jx a£A Ji^c \Jx / 

/ ( sup / w„(y)p(dy| 2 :,a) W(ck 

JiC£ \a£A Jx / 

/ (a«j(z) + 7 ) v n (dz) 

= aw n (x)+ 7 , 


< 

< 


(3.10) 


where (3.10) can be proved following the same arguments as for the case x £ K n . This completes 
the proof. □ 

Note that if we define c n> 0 (x) = 1 + sup aeA | 6 „(x,a)| and c„, t (x) = sup agA f x c riit _ 1 (y)q n (dy\x, a), 
by (3.8) and (3.9), and an induction argument, we obtain (see Hernandez-Lerma and Lasserre [22, 
p. 46]) 


t-1 

c n , t (x) < Lw n (x)o? + for all x £ X, 

i=o 

where L = 1 + M. Let /3 0 > fd be such that a(3 0 < 1 and let C n : X —>• [1, 00 ) be defined by 

OO 

C n (x) =y j p t 0 c n , t (x). 

t =0 


(3.11) 
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Then, for all x £ X, by (3.11) we have 


L 


C'n(x) :=Y^ C n,t(®) < 1- ^—W n {x) + — 

U 1 -P° a 


Lfio 


— / 5 0 ) (1 -Poa) 


7 


:— Hi w n (x) + Lo. 


(3.12) 


Hence C n £ B(X) as w n € B(X). Moreover, for all (x,a) £ X x A, C n satisfies (see Hernandez-Lerma 
and Lasserre [22, p. 45]) 


p OO A 

/ C n (y)q n (dy\x,a)=y^ j /3 t 0 / c ntt (y)q n (dy\x,a) 
JX +—n Jx 


t =0 
oo 

<j2^ Cn , t+1 (x) 

t =o 


< y ^2 Po c n , t {x) = «o C n (x), 


t =0 


where a 0 := and a 0 /3 < 1 since j3 0 > ft. Therefore, for all x £ X, components of MDP„ satisfy 


sup \b n (x, a) | < C n (x) 

a£ A 

sup / C n (y)q n (dy\x,a) <a 0 C n (s 

aGA Jx 


(3.13) 

(3.14) 


Since C n £ B(X), the Bellman optimality operator T n of MDP„ maps B l (X) into B l (X) and is 
given by 


T n u(x )=inf 


06 A 


b n (x,a)+/3 / u(y)q n (dy\x,a) 
Jx 


inf aeA [c(x, a)+P f x u(y)p(dy\x, a)], if x £ K n 

inf aeA f KC [c(z, a) + ft f x u{y)p{dy\z,a)\v n {dz), if x £ K°. 


Then successive approximations to the discounted value function of MDP„ are given by = 0 
and u* +1 = T n v* n (t > 1). Since a 0 p < 1, it can be proved as in Hernandez-Lerma and Lasserre [22, 
Theorem 8.3.6, p. 47] and Hernandez-Lerma and Lasserre [22, (8.3.34), p. 52] that 


K(*)l>l jr n( a OI < 7^-^ fo r all x, 

1 — <7q 


\v* _ J*\\ r < 
I n nil ( ~'n — 


<7n 




(3.15) 

(3.16) 


where cr 0 = Pa 0 < 1. 

Similar to v l n , let us define v° = 0 and v t+1 = Tv*, where T : B W (X) —> B W (X), the Bellman 
optimality operator for the original MDP, is given by 


Tu(x) = inf 


a£A 


c(x,a)+/3 / u(y)p(dy\x,a) 


Then, again by Hernandez-Lerma and Lasserre [22, Theorem 8.3.6, p. 47] and Hernandez-Lerma 
and Lasserre [22, (8.3.34), p. 52] we have 


\v t (x)\,\J*(x)\< M for all x, 




1 — <7 
( 7 * 

1 — <7 ’ 


(3.17) 

(3.18) 


where <7 = Pa < 1. 
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Lemma 3.4. For any compact set K C X, we have 


lim sup \v^(x) — v t {x)\ = 0 (3.19) 

n ^°°xGK 


for all t> 1. 

Proof. We prove (3.19) by induction on t. For t = 1, the claim trivially holds since any compact 
set K C X is inside K n for sufficiently large n, and therefore, b n = c on K for sufficiently large n 
(recall v° = v° = 0). Assume the claim is true for t > 1. Fix any compact set K. Recall the definition 
of compact subsets K e of X in Lemma 3.1. By definition of q n , b n , and w n , there exists no > 1 such 
that for all n > n 0 , q n = p, b n = c, and w n = w on K. With these observations, for each n>n 0 we 
have 


sup |u„ (x) — v t+ \x)\ = sup 

xGK 16 K 


inf 

A 


< (3 sup 

( x,a)(zK x A 


— mm 

A 


?(x,a)+P v t (y)p(dy\x,a ) 
Jx 


c(x, a) + (3 v t n {y)p(dy\x, 

Jx 

/ v t n {y)p{dy\x,a)~ / v t {y)p{dy\x,< 

Jx Jx 

/ {v t n (y)-v*(y)) p(dy\x,a)+ (u*(y) - v\y)) p(dy\x,a) 
Jk f _ Jks 


= f3 sup 

(x,a)£K X A 


<(3\ sup |u^(x) — v t (x)\ + sup 

^xEKe (x,a)£Kx A 


/ «(y)-^(y)) p(dy\x,( 
Jks 


Note that we have |u 4 | < by (3.17). Since w n < 7 max w, where 7 max := max{l, 7 }, we also have 

|<| < £l7l ) 1 ^ +Z ' 2 < (Ll7 ^ x + L2)tt ' by (3.12) and (3.15) (as w>\). Let us define 


Then by Lemma 3.1 we have 


0 Ai 7 max + L 2 

K := ---r 

1 - Co 


M 
1 — a 


sup |u* + 1 (x) — u* + 1 (x)| < f3 sup |u* (x) — ?/(x)| + (3Re. 

x£K x£K e 

Since the first term converges to zero as n —>• 00 by the induction hypothesis, and e is arbitrary, 
the claim is true for t + 1. This completes the proof. □ 

The following theorem states that the discounted value function of MDP„ converges to the 
discounted value function of the original MDP uniformly on each compact set K C X. 

Theorem 3.1. For any compact set K c X we have 


lim sup | J*(x) — J*(x) | = 0. 

"->• °°xGK 


(3.20) 


Proof. Fix any compact set K C X. Since w is continuous and therefore bounded on K, it is 
sufficient to prove lim^c*, sup xeAr ( ' r) ^ ■ Let n be chosen such that K C K n , and so, w n = w 

on K. Then we have 

o . . 

xG K W(X) 


._ \Jn( X )-<i X )\ , „„„ K(») -«*(») I , „„„ \v t {x)-J*{ X )\ 

oU]_) , . 1 oUp , » r oLip » 

_ w(x) X £K w[x) xeif w(x) 


< Sl ,p + sup KW 7 "‘Ml + M ° 


(by (3.18)) 


xgk _ C n (x) w(x) xe k w(x) ' 1 -fJ 4 

„ \Jn( X ) -Jn( X )\ (L lWn (x)+L 2 ) K(x) -v\x)\ MCT* 

< sup- rrr\ - T~\ -+ SU P- 1 - ~t b Y ( 3 - 12 )) 

xGK C n (x) _ w[x) xG K W{x) l-cr* 

\ J n( X ) ~ K( X )\ , „l«n (s) -«*(*)! , M(jt 


< (Li + L 2 ) sup 


+ sup ■ 


+ 


< (Li + L 2 


xGK 

at 


1 — (T* 


C n (x) x Ik w(x) 

sup K M -^ M!+ MV 
1-00 xga w(aO 1-0 - * 


(u; n = w on K) 
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Since w > 1 on X, sup xgif -"'— - -> 0 as n -MX) for all t by Lemma 3.4. Hence, the last 
expression can be made arbitrarily small. This completes the proof. □ 

In the remainder of this section, we use the above results and Theorem 2.2 to compute a near 
optimal policy for the original MDP. It is straightforward to check that for each n, c-MDP,, 
satisfies the assumptions in Theorem 2.2. Let {e„} be a sequence of positive real numbers such 
that lim^oo e n = 0. 

By Theorem 2.2, for each n > 1, there exists a deterministic stationary policy f n € F„, obtained 
from the finite state approximations of c-MDP„, such that 

SUp \J n (f n ,x) - J*(x)\<£ n , 

where for each n, finite-state models are constructed replacing (Z.A,p,c) with the components 
(X n ,A,p n ,c n ) of c-MDP„ in Section 2. By Lemma 3.2, for each n > 1 we also have 

sup [</„(/„, x) - J*(x)\ <e„, (3.21) 

x£X 

where, with an abuse of notation, we also denote the extended (to X) policy by /„. Let us define 
operators R n : B Cn (X) -t B Cn (X) and R n : B W (X) ->• B W (X) by 

o / x i c ( x Jn{x)) + /3 f x u{y)p(dy\x, f n (x)), 

[/ X c [ c ( z , fn{ z )) + P f x u(y)p(dy\ z J n (z))]v n (dz), 

R n u(x) = c(x,f n (x)) + P / u(y)p(dy\x, f n (x)). 

Jx 

By Hernandez-Lerma and Lasserre [22, Remark 8.3.10, p. 54], R n is a contraction operator with 
modulus cto and R n is a contraction operator with modulus a. Furthermore, the fixed point of R n 
is ./„(/„,x') and the fixed point of R n is J(f n ,x). For each n > 1, let us define u° n = u° n = 0 and 
vtf 1 = R n u * n , ulf 1 = RnUn (t > 1). One can prove that (see the proof of Hernandez-Lerma and 
Lasserre [22, Theorem 8.3.6, p. 51]) 

\ui{x)l\J n (fnix)\<^^- 

1 “Z 70 

\K-j n (f n ,-)\\c n <T^- 

1 — a 
1 — a 

Lemma 3.5. For any compact set K C X, we have 

lim sup lu^x) — u t n (x)\ = 0. 

n ^°°xeK 

Proof. The lemma can be proved using the same arguments as in the proof of Lemma 3.4 and 
so we omit the details. □ 

Lemma 3.6. For any compact set K C X, we have 

lim sup|j„(/ n ,x) - J(f n ,x)\ = 0. (3.22) 

n->oo X&K 

Indeed, this is true for all sequences of policies in F. 


if x € K n 
ifx£ K%, 
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Proof. The lemma can be proved using the same arguments as in the proof of Theorem 3.1. □ 

The following theorem is the main result of this section which states that the true cost functions 
of the policies obtained from finite state models converge to the value function of the original MDP. 
Hence, to obtain a near optimal policy for the original MDP, it is sufficient to compute the optimal 
policy for the finite state model that has sufficiently large number of grid points. 

Theorem 3.2. For any compact set K c X, we have 

lirn sup|J(/ n ,x)-J*(x)| = 0. 

n ~>°°xeK 

Therefore, 


lim |J(/ n ,x) — J*(x)\ = 0 forallx£X. 

n—>oo 

Proof. The result follows from (3.20), (3.21), and (3.22). □ 

3.2. Average Cost. In this section we obtain approximation results, analogous to Theo¬ 
rems 3.1 and 3.2, for the average cost criterion. To do this, we impose some new assumptions on 
the components of the original MDP in addition to Assumption 3.1. These assumptions are the 
unbounded counterpart of Assumption 2.2. With the exception of Assumption 3.2-(j), versions of 
these assumptions were imposed in Vega-Amaya [38], Gordienko and Hernandez-Lerma [18], and 
Jaskiewicz and Nowak [26] to study the existence of the solution to the Average Cost Optimality 
Equality (ACOE) and Inequality (ACOI). In what follows, for any finite signed measure •& and 
measurable function h on X, we let d{h) := f h(x)d(dx) and 

||#IL := su p / g{x)$(dx). 

Here ||i?|| w is called the tc-norm of d. 

Assumption 3.2. Suppose Assumption 3.1 holds with item (b) and (3.2) replaced by conditions 
(j) and (e) below, respectively. In addition, there exist a probability measure rj on X and a positive 
measurable function (j> : X x A —> (0, oo) such that for all (x, a) € X x A 

(e) f x w(y)p(dy\x,a) <otw(x) + r)(w)(j)(x,a), where a £ ( 0,1). 

(f) p{D\x , a) > r)(D)(j)(x , a) for all D € £>(X). 

(g) The weight function w is rj-integrable, i.e., rj(w) < oo. 

(h) For each n> 1, inf( Xi<1 ) e x n xA> 0. 

(j) The stochastic kernel p{ ■ \x,a) is continuous in (x,a) with respect to the w-norm. 

Throughout this section, it is assumed that Assumption 3.2 holds. Conditions (e), (f), and (g) 
of Assumption 3.2 are unbounded counterparts of conditions (d) and (e) in Assumption 2.2. Recall 
that condition (e) corresponds to the so-called ‘drift inequality’ and condition (f) corresponds to 
the so-called ‘minorization’ condition which guarantee the geometric ergodicity of Markov chains 
induced by stationary policies (see Hernandez-Lerma and Lasserre [22], Meyn and Tweedie [29] 
and references therein). These assumptions are quite general for studying average cost problems 
with unbounded one-stage costs. In addition, they are proper for the approximation problem in 
the sense that it can be shown that if the original problem satisfies these, then the reduced models 
constructed in the sequel satisfy similar conditions. There is only one minor difference between As¬ 
sumption 3.2-(f) and the standard minorization condition: in the literature <f is in general required 
to be nonnegative instead of positive. 

Note that although Assumption 3.2- (j) seems to be restrictive, it is weaker than the assumptions 
imposed in the literature for studying approximation of average cost problems with unbounded 
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cost (see Dufour and Prieto-Rumeau [15]). Indeed, it is assumed in Dufour and Prieto-Rumeau 
[15] that the transition probability p is Lipschitz continuous in (x,a) with respect to tc-norm. The 
reason for imposing such a strong condition on the transition probability is to obtain convergence 
rate for the approximation problem. Since we do not aim to provide rate of convergence result 
in this section, it is natural to impose continuity instead of Lipschitz continuity of the transition 
probability. However, it does not seem possible to replace continuity with respect to the re-norm by 
a weaker convergence notion. One reason is that with a weaker continuity notion it is not possible 
to prove that the transition probability of c-MDP„ is continuous with respect to the total variation 
distance, which is needed if one wants to use Theorem 2.6 and cannot be relaxed as explained in 
Remark 2.3. 

Analogous with Theorem 2.5, the following theorem is a consequence of Vega-Amaya [38, The¬ 
orems 3.3], Gordienko and Hernandez-Lerma [18, Lemma 3.4] (see also Hernandez-Lerma and 
Lasserre [22, Proposition 10.2.5, p. 126]), and Jaskiewicz and Nowak [26, Theorem 3], which also 
holds with Assumption 3.2-(j) replaced by Assumption 3.1-(b). 

Theorem 3.3. For each f G F ; the stochastic kernel p(-\x,f(x )) is positive Harris recurrent 
with unique invariant probability measure p f . Furthermore, w is pf-integrable, and therefore, pf := 
f x c(x, f)pf(dx) < oo. There exist positive real numbers R and k< 1 such that 

sup || p\ ■ \x, /(x)) - p f \\ w < Rw(x)n t (3.23) 

/£ F 

for all x G X, where R and k continuously depend on a, rj(w), and ini f eV r]((j)(y, f(y))). Finally, there 
exist f* G F and h* G B W (X) such that the triplet (h *, f* , pf*) satisfies the average cost optimality 
equality (ACOE), and therefore, 


inf V(tt,x) := V*{x) = p r , 

7r£ll 


for all x € X. 

Note that (3.23) implies that for each / G F, the average cost is given by V(f,x) = 
f x c(y, f(y))pf(dy) for all x G X (instead of pf- a.e.); that is, the average cost is independent of the 
initial point. 

Recall that V n and V n denote the average costs of c-MDP„ and MDP n , respectively. The 
value functions for average cost are denoted analogously to the discounted cost case. Similar to 
Lemma 3.2, the following result states that MDP„ and MDP n are not too different for the average 
cost. 


Lemma 3.7. Suppose Theorem 3.3 holds for MDP n and Theorem 2.5 holds for MDP n . Then 
we have 


V*{x ), if x € K n 

K*(A„), ifxeKZ. 


(3.24) 


Furthermore, if, for any deterministic stationary policy f G F n; we define /(x) = /(x) on K n and 
f(x) = /(A„) on K c n , then 


Vn{f,x) 


V n {f,x), if x £ K n 

v„(f, An), ifx£K c n . 


(3.25) 


In particular, if the deterministic stationary policy f* G F„ is optimal for MDP n , then its extension 
f* to X is also optimal for MDP n . 
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Proof. Let the triplet satisfy the ACOE for c-MDP n , so that f* is an optimal policy 

and pj* is the average value function for c-MDP„. It is straightforward to show that the triplet 


K(x ) = 


and 


fn(x) = 


where 


K(x), 

if x £ K n 

K( A n ), 

if x £ K'; 0 

fnix), 

if x £ K n 

/n*(A „), 

if x £ K 


By Gordienko and Hernandez-Lerma [18, Lemma 5.2] (see also Hernandez-Lerma and Lasserre [21, 
Section 5.2]), this implies that f* is an optimal stationary policy for MDP n with cost function pf*. 
This completes the proof of the first part. 

For the second part, let / £ F„ with an unique invariant probability measure pf £ V(X n ) and let 
/ £ F denote its extension to X with an unique invariant probability measure pj. It can be proved 
that 


p f (-) = p f (-nK n ) + p f (K c n )5 An (-). 


Then we have 


V n (f,x)= / b n (x, f(x))pf(dx) 

f 

= / C n (x, f(x))pf(dx) + pf(K')c n (A n , f(A n )) 

JK.n 

= / c n (x,f(x))p f (dx) 

J 'An 

= V n (f,x). 


This completes the proof. □ 

By Lemma 3.7, in the remainder of this section we need only consider MDP„ in place of MDP„. 
Later we will show that Theorem 3.3 holds for MDP„ for n sufficiently large and that Theorem 2.5 
holds for c-MDP„ for all n. 

Recall the definition of constants and r n from (3.3) and (3.4). For each n > 1, we define 
cf) n : X x A — > (0, oo) and <j n £ M as 

(j)(x,a), 

f <j)(y, a)v n (dy), 

f, n 

■= I w{y)r]{dy). 

Since p(w) < oo and r n can be made arbitrarily small by properly choosing o n . we assume, without 
loss of generality, the following. 

Assumption 3.3. The sequence of probability measures {u n } is chosen such that the following 
holds 



if re £ K n 
if z £ K°, 


lim (r„ + o n ) = 0. 

n—>oo 


(3.26) 
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Let OL n '■— a + g n + T n . 

Lemma 3.8. For all n and (x, a) £ X x A, the components of MDP n satisfy the following: 


L 


sup \b n (x, a)| < Mw n (x ) 

aG A 

w n (y)q n (dy\x,a) < a n w n (x) + rj(w n )(j) n (x, a), 

q n (D\x, a) > r)(D)cj) n (x , a) for all D € £>(X). 


(3.27) 


Proof. The proof of the first inequality follows from Assumption 3.2 and definitions of b n and 
w n . To prove the remaining two inequalities, we have to consider the cases x £ K n and x € K £ 
separately. 

Let x £ K n , and therefore, q n ( ■ \x, a)=p(- \x, a). The second inequality holds since 


/ w n {y)p(dy\x,a)= / w(y)p(dy\x, a) + / ( 7 ™ ~w(y)) p(dy\x, a 

Jx Jx Jkz 


< 


/ w(y)p(dy\x,a)+r n 
Jx 


< aw{x) + r){w)(j){x, a) + r„ 

<aw n (x)+rj(w n )(j) n (x,a)+s n (j) n (x,a) + T n (as w n = w and cf n = (f> on K n ) 
<a n w n (x)+ rj(w n )(l) n (x,a), (as (j) n < 1 and w n > 1). 


For the last inequality, for all D € £>(X), we have 


q n (D\x, a) = p(D\x, a) > rj(D)(j)(x, a) = y(D)(j) n {x , a) (as (j) n = ^ on K n ). 


Hence, inequalities hold for x £ K n . 
For x £ we have 


J w n (y)q n (dy\x,a) = J (^j w n (y)p(dy\z,a)^Jv n (dz) 

< / (aw(z) + rj(w n )(/)(x,a) +q n (f{x,a) + r n ) v n {dz) 
J K c 


= aw n (x) + rj(w n )(j) n (x, a) + ? n 0 n (x, a) + r„ 

<a n w n (x) + ri(w n )(l) n (x,a), (since (f n <l and w n > 1) 


(3.28) 


where (3.28) can be obtained following the same arguments as for the case x £ K n . The last 
inequality holds for x £ since 


q n (D\x,a)= p(D\z,a)v n (dz ) 


' K c 


> 


/ rj(D)(j)(z,a)v n (dz) 
Jks 


= V (D)(j} n (x,a). 


This completes the proof. □ 

We note that by (3.26), there exists n 0 > 1 such that a n < 1 for n > n 0 . Hence, for each n > n 0 , 
Theorem 3.3 holds for MDP n with w replaced by w n for some R n > 0 and n n £ (0,1), and we have 
R max := sup„> no R n < 00 and 

^max : =sup„> no K n < 1. 

In the remainder of this section, it is assumed that n > n 0 . 
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Lemma 3.9. Let g : X x A —be any measurable function such that sup agA \g(x,a)\ < M g w(x) 
for some M g € R. Then, for all t > 1 and any compact set K C X we have 


sup 

( y ,f)eKx F 


g n {x,f{x))q t n (dx\y,f{y))- f g{x, f{x))p\dx\yj{y)) 

\ J X 


asn^-oo, where g n (x,a) = g(x,a) onK n xA andg n (x,a)= f KC g(z, a)v n (dz) onK^xA. 

Proof. We will prove the lemma by induction. Fix any compact set K C X. We note that in 
the inequalities below, we repeatedly use the fact 4>,4> n < 1 without explicitly referring to this 
fact. Recall the definition of the compact subsets K E of X in Lemma 3.1 and the constant 7 max = 
max{l, 7 }. Note that sup ogA \g n (x,a)\ < M g w n (x ) < M 9 7 max u;(:c) for all x £ X. 

The claim holds for t = 1 by the following argument: 


sup 

(y,f)EKxF 


g n (x, f(x))q n (dx\y, f(y)) - / g(x, f(x))p(dx\y, f(y)) 

Jx 

/ g n (x,f(x))p(dx\y,f{y))- / g(x,f(x))p(dx\y,f{y)) 
Jx Jx 


= sup 

(y,/)£KxF 

= sup 

(y./le^x f jk% 

— -^s(l + 7max)£) 


g n (x, f(x))p(dx\y, f{y)) - / g(x, f{x))p{dx\y, f(y)) 


(for n sufficiently large) 
(for n sufficiently large) 


IKS 


where the last inequality follows from Lemma 3.1. Since e is arbitrary, the result follows. 

Assume the claim is true for t > 1. Let us define lf(z) := f x g(x, f(x))p t (dx\z, f(z)) and lf(z) := 
f x g n (x, f(x))q t n (dx\z, f(z)). By recursively applying the inequalities in Assumption 3.2-(e) and in 
(3.27) we obtain 


t-i 


sup \lf(z) \ < Mgofw 
few 


(z) + M g g(w ) ^ a 3 
3 =0 


and 


t -1 


/€F 


sup 1 1](z) I < M g a* n w n (z) + M g r)(w n ) J] a 3 n 

j=o 

t -1 

<M g ^maxfmax* (z) + M g ri{w)') 

max E 


3=0 


where a max := sup n>nQ a n < 1. Then we have 

[ gn{x,f< y x))q t r f 1 {dx\y,f{y)) - f g(x, f{x))p t+1 (dx\y, f(y)) 
Jx Jx 

](z)q n (dz\y,f(y))- f l f (z)p(dz\y, f(y)) 

Jx 

](z)p(dz\y,f{y)) - f l f (z)p(dz\y, f(y)) 

Jx 


sup 

(yJ)eKxW 


= sup 

(yJ)eKxW 


= sup 

(yJ)eKxW 

< sup 

(y,/)£ifxF JK% 

<Re+ sup \Tf(z) - lf(z)\, 

(z,f)EK e xW 


(for n sufficiently large) 


l n f(z)p{dz\yj(y))~ / l f (z)p{dz\y,f(y)) 




+ sup \l](z)-l f (z)\ 

(z,f)£K e x W 

(3.29) 
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where R is given by 


R := M n a 


t -i 


-a, 


max'Tmax 


: + 77(»y~V +r)(w)'y n 

J=0 


£-1 

3=0 


n 


and the last inequality follows from Lemma 3.1. Since the claim holds for t and K e , the second 
term in (3.29) goes to zero as n — > oo. Since e is arbitrary, the result follows. □ 

In the remainder of this section the above results are used to compute a near optimal policy for 
the original MDP. Let {e n } be a sequence of positive real numbers converging to zero. 

For each / £ F, let pf denote the unique invariant probability measure of the transition 
kernel q n (-\x,f(x)) and let pf denote the associated average cost; that is, pf := V n (f,x) = 
Jx b n (y, for all initial points x € X. Therefore, the value function of MDP„, denoted by 

V*. is given by V*(x) = inf/ S F/?p he., it is constant on X. 

Before making the connection with Theorem 2.6, we prove the following result. 

Lemma 3.10. The transition probability p n of c-MDP n is continuous in ( x,a ) with respect to 
the total variation distance. 

Proof. To ease the notation, we define M(X„), M(X), and M W (X) as the subsets of B(X n ), L>(X), 
and B W (X), respectively, whose elements have (corresponding) norm less than one. Let (x k ,a k ) —> 
(x,a) in X„ x A. Since the pseudo state A„ is isolated and K n is compact, we have two cases: (i) 
x k = x = A n for all k large enough, or (ii) x k —> x in K n . 

For the first case we have 


|| Pn{ ■ |A„,Ofc) ~p n ( • |A„,a)|| T y = SUp 

g£M (X. n ) 


< sup 
seM(x) 

= sup 

ff£M(X) 


g(y)Pn(dy\A n ,a k )~ g(y)p n {dy | A, 

/ g{y)qn(dy\A n ,a k ) - / g(y)q n (dy\A n ,a) 
Jx Jx 


(3.30) 


g(y)p(dy\z,a k )~ / g(y)p(dy\z,a))u n (dz) 


g(y)p(dy\z,a k ) - / g(y)p(dy\z,a 
c Jx 

/ g(y)p(dy\z,a k ) - / g(y)p(dy\z,t 
’x Jx 

II p{ ■ \z,a k ) —p( ■ \z,a)\\ w is n (dz), 


< / sup 
JKC s£M(X) 

< / sup 

Jks g&M w (X) 


I K c 


v n (dz ) 
Vn(dz) 


(3.31) 


where (3.30) follows since if for any g £ M(X„) we define g = g on K n and g = g(A n ) on K then 
we have g £ M(X) and f x g(y)p n (dy\x,a) = j x g(y)q n (dy\x,a) for all (x,a) £ X„ x A. Note that we 
have 


sup 

g£M w (x) 


g(y)p{dy\z,a k ) - / g(y)p(dy\z,a) < / w(y)p(dy\z, a k ) + / w{y)p{dy\z,a) 
Jx Jx Jx 


<2 (a + rj(w))w(z) 


by Assumption 3.2-(e), <j> < 1, and w > 1. Since w (restricted to Kf) is ;/ n -integrable, by the 
dominated convergence theorem (3.31) goes to zero as k —>• oo. 

For the second case we have 

\\p n {-\x k ,a k )-p n (-\x,a)\\ TV = sup 

g£M(X n ) 


/ g(y)Pn(dy\x k ,a k )~ g(y)p n (dy\x,a) 
Jx n JXr, 
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< sup 

[ 

geM(X) 

Jx 

f 

= sup 

/ 

geM(X) 

Jx 

< sup 


g£Mw(X) J 


g{y)p{dy\x k ,CL k ) - / g(y)p(dy\x,a) 
Jx 

/ g(y)p(dy\x k ,a k ) - / g(y)p(dy\x,c 
Jx Jx 

\x k ,a k ) -p(-\x,a)\\ w . 


(since x k , x G K n ) 


By Assumption 3.2-(j) the last term goes to zero as k —>• oo. □ 

Thus we obtain that for each n > 1, c-MDP„ satishes the assumption in Theorem 2.6 for 


((■) = !](■ nK n ) + r)(Kc)5 An (-), 
n, , x _ f HxeK n 

\f KCn cKy, a )v n {dy), if x = A„, 

and some A G (0,1), where the existence of A follows from Assumption 3.2-(h) and the fact that 

4 > > o. 

Consequently, there exists a deterministic stationary policy f n G F„, obtained from the finite 
state approximations of c-MDP„, such that 


sup \V n (f n ,x)-V*(x)\<s n , (3.32) 

x£Xn 

where finite-state models are constructed replacing (Z, A,p,c) with the components (X„, A ,p n ,c n ) 
of c-MDP„ in Section 2. By Lemma 3.7, we also have 

\P%~V; \<s n , (3.33) 

where, by an abuse of notation, we also denote the policy extended to X by f n . 

Lemma 3.11. We have 


sup |p" — P/| —>• 0 
few 

as n —>■ oo. 

Proof. Fix any compact set K C X. For any t > 1 and y G K, we have 


(3.34) 


sup -p f \ = sup 


few 


few 

< sup 
few 


/ K(xJ(x))fi n f (dx)- / c(x,f(x))pf(dx 
Jx Jx 

[ b n (x, f (x))pf(dx) — f b n {xj(x))q t n {dx\y,f{y )) 

Jx Jx 

b n {xj{x))(f n {dx\y,f(y))- [ c(x,f(x))p t (dx\y,f(y)) 

Jx 

'( x ,f( x ))p\dx\y,f(y))- / c{x,f(x))p, f (dx) 

Jx 


T sup 

few 


+ sup 

few 


< MR max w(y)K t max + MRw(y)K t + 


sup 

(yJ)eKx f 


[ b n (xj(x))q t n (dx\y,f(y)) - f c(xj(x))p\dx\yj(y)) 
Jx Jx 


where the last inequality follows from Theorem 3.3-(ii) and (3.1) in Assumption 3.1. The result 
follows from Lemma 3.9. □ 
















30 


Saldi, Yiiksel, and Linder: Asymptotic Optimality of Finite Approximations to MDPs 
Mathematics of Operations Research 00(0), pp. 000—000, ©0000 INFORMS 


Theorem 3.4. The value function of MDP n converqes to the value function of the oriqinal 
MDP, i.e., \V,'* - V*\ ->• 0, as n -)• oo. 

Proof. Since 


\V*-V*\ = | inf p n f - inf p f \ <sup\p n f - p f \, 

/eF 1 f eF few f 

the result follows from Lemma 3.11. □ 

The following is the main result of this section which states that the true average cost of the 
policies f n obtained from finite state approximations of c-MDP„ converges to the average value 
function V* of the original MDP. 

Theorem 3.5. We have \ p fn — V* \ — > 0, as n —>■ oo. 

Proof. We have 


I Pf n -v*\< | Pfn -P n fn | + |Pl -V*\ + \V* -P*| 

< sup \p f — p n f \+£ n + \V* — V* | (by (3.33)) 
few 

The result follows from Lemma 3.11 and Theorem 3.4. □ 

4. Discretization of the Action Space. For computing near optimal policies using well 
known algorithms, such as value iteration, policy iteration, and Q-learning, the action space must 
be finite. In this section, we show that, as a pre-processing step, the action space can taken to 
be finite if it has sufficiently large number of points for accurate approximation. Throughout this 
section, it is assumed that Assumption 3.1 holds for the discounted cost and Assumption 3.2 holds 
for the average cost. 

It was shown in Saldi et al. [33] and Saldi et al. [34] that any MDP with (infinite) compact action 
space can be well approximated by an MDP with finite action space under assumptions that are 
satisfied by c-MDP„, for both the discounted cost and the average cost cases. Specifically, let d A 
denote the metric on A. Since A is compact, one can find a sequence of finite subsets {A fc } of A 
such that for all k 


min d/\(a, a) < 1/k for all a € A. 


We define c-MDP„ jfc as the Markov decision process having the components |X„, A fe ,p n , c„} and we 
let F„(A fc ) denote the set of all deterministic stationary policies for c-MDP„ jfe . Note that F„(A fc ) is 
the set of policies in F„ taking values only in A fc . Therefore, in a sense, c-MDP„ ;fc and c-MDP„ can 
be viewed as the same MDP, where the former has constraints on the set of policies. For each n and 
k , by an abuse of notation, let /* and f* k denote the optimal stationary policies of c-MDP„ and 
c-MDP n fe , respectively, for both the discounted and average costs. Then Saldi et al. [34, Theorem 
3.2] and Saldi et al. [33, Theorem 3.2] show that for all n, we have 

lim J n {f* k ,x) = J„(f*,x) := J*(x) 

K —^OO 

lim V n (f* k ,x) = V n (f*,x), ■■= V*(x) 

oo 

for all x £ X„. In other words, the discounted and average value functions of c-MDP„ jfc converge 
to the discounted and average value functions of c-MDP„ as k —> oo. We note that although Saldi 
et al. [34, Theorem 3.2] and Saldi et al. [33, Theorem 3.2] are proved for nonnegative one-stage 
cost function, it is straightforward to check that these theorems are also valid for any real valued 
one-stage cost function. 
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Theorem 4.1. For any x £ X, there exists a subsequence {k n } such that 

lim J (fn, kn i x ) = J *( x ) 

n—> oo 

lim V ifn,k n ’ X ) = V *( X ), 

n—t oo 

where ff kn £T(A fcri ) is the optimal stationary policy of c-MDP nkn . 

Proof. Let us fix x £ X. For n sufficiently large (so x £ K n ), we choose k n such that ]«/„(/* kn ,x) — 
,/„(/*.x)| < 1/n (or \V n (f* kn ,x) — V n {f*,x)\ < 1 /n for the average cost). We note that if A is a 
compact subset of a finite dimensional Euclidean space, then by using Saldi et al. [33, Theorems 
4.1 and 4.2] one can obtain an explicit expression for k n in terms of n under further continu¬ 
ity conditions on c and p. By Lemmas 3.6 and 3.11, we have |J„(/* kn ,x) — J{fn kn i x ) I —>• 0 and 
I ^4 (fn,k n > x ) ~ V ( fn,kn > x ) I 0 as n —> oo, where again by an abuse of notation, the policies extended 
to X are also denoted by f* kn . Since J n (f* ikn ,x) = J n (f*, kn , x ) and V n (f* kn ,x) = V n (f* kn ,x), using 
Theorems 3.1 and 3.4 one can immediately obtain 

lim J(fn,k n , x ) = J *(*) 

n—>-oo ’ 

lim V(f* k x) = v *{x). 

n—> oo 

□ 

Theorem 4.1 implies that before discretizing the state space to compute the near optimal policies, 
one can discretize, without loss of generality, the action space A in advance on a finite grid using 
sufficiently large number of grid points. 


5. Rate of Convergence Analysis for Compact-State MDPs. In this section we consider 
(Q2) for MDPs with compact state space; that is, we derive an upper bound on the performance 
loss due to discretization in terms of the cardinality of the set Z„ (i.e., number of grid points) . 
To do this, we will impose some new assumptions on the components of the MDP in addition to 
Assumptions 2.1 and 2 . 2 . First, we present some definitions that are needed in the development. 
For each g £ C b { Z), let 


111? 11 Lip 


sup 

(z,y)e ZxZ 


\g(z) -g(y)\ 

di(z,y ) 


If 11<711 Lip is finite, then g is called Lipschitz continuous with Lipschitz constant ||g|| L ip. Lip(Z) 
denotes the set of all Lipschitz continuous functions on Z, i.e., 


Lip( z ) : = {d £ C b ( Z) : 11< 7 11 Lip < oo} 

and Lip(Z, K ) denotes the set of all g £ Lip(Z) with ||g|| L ip < K. The Wasserstein distance of order 
1 Villani [39, p. 95] between two probability measures £ and £ over Z is defined as 


Wi(C >0 : = su p 


gd( - / gdf 


: g £ Lip(Z, 1) 


W\ is also called the Kantorovich-Rubinstein distance. It is known that if Z is compact, then 
kLi(C,0 < diam(Z)||£ — £||tv; see Villani [39, Theorem 6.15, p. 103]. For compact Z, the Wasser¬ 
stein distance of order 1 is weaker than total variation distance. Furthermore, for compact Z, the 
Wasserstein distance of order 1 metrizes the weak topology on the set of probability measures V(Z) 
(see Villani [39, Corollary 6.13, p. 97]) which also implies that convergence in this sense is weaker 
than setwise convergence. 

In this section we impose the following supplementary assumptions in addition to Assumption 2.1 
and Assumption 2 . 2 . 
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Assumption 5.1. 

(g) The one-stage cost function c satisfies c( •,a) £ Lip(Z, Kf) for all a £ A for some K x . 

(h) The stochastic kernel p satisfies W\ (p( ■ \z,a),p(- \y,a)) < K 2 d z (z,y ) for all a £ A for some 
K 2 . 

(j) Z is an infinite compact subset ofM. d for some d> 1, equipped with the Euclidean norm. 

We note that Assumption 5.1 -(j) implies the existence of a constant a > 0 and finite subsets 
Z n cZ with cardinality n such that 

max min d z ( 2 ,?/) < a(l/n) 1 ' /d (5.1) 

z£Z y^fiZ-n 

for all n, where d z is the Euclidean distance on Z. In the remainder of this section, we replace Z n 
defined in Section 2 with Z n satisfying (5.1) in order to derive explicit bounds on the approximation 
error in terms of the cardinality of Z n . 


5.1. Discounted Cost. Assumptions 2.1 and 5.1 are imposed throughout this section. Ad¬ 
ditionally, we assume that K 2 f3 < 1. The last assumption is the key to prove the next result which 
states that the value function J* of the original MDP for the discounted cost is in Lip(Z). Although 
this result is known in the literature (see Hinderer [24]), we give a short proof for the sake of 
completeness using a simple application of the value iteration algorithm. 

Theorem 5.1. The value function J* for the discounted cost is in Lip(Z,A), where K = 

Kl 1-/3 K 2 ■ 

Proof. Let u £ Lip(Z, K ) for some K > 1. Then g = £ Lip(Z, 1) and therefore, for all a £ A and 

z, y £ Z we have 


/ u(x)p(dx\z, a) — 

/ u{x)p{dx\y 1 a) 

= K 

/ g(x)p(dx\z,a) - / g(x)p(dx\y,a) 

J z 

h 

<K 

J z J z 

Wi(p{-\z,a),p(-\y,a)) < KK 2 d z (z,y) 


by Assumption 5.1-(h). Hence, the contraction operator T defined in (2.2) maps u £ Lip(Z,lv) to 
Tu £ Lip(Z, Ki + (3KK 2 ), since, for all z,y £ Z 

| Tu(z) -Tu(y )| < maxi \c(z,a) -c(y,a)\ + /3 

a£A I 

< K l d z (z. y) + /3KK 2 d z (z , y) 

Now we apply T recursively to obtain the sequence { T n u } by letting T n u = T(T"^ 1 u), which 
converges to the value function J* by the Banach fixed point theorem. Clearly, by induction we 
have for all n > 1 


J u(x)p(dx\z,a) — J u(x)p(dx\y,t 
= (!<:, +/3KK 2 )d z (z,y). 


T n u € Lip(Z, K n ), 


where K n = K y YflJ-) + K(f3K 2 ) n . If we choose K < K ly then K n < A'„ +1 for all n and 
therefore, K n 1 _^ K2 since K 2 /3 < 1. Hence, T n u £ Lip(Z, K\ 1 _j 3K2 ) for all re, and therefore, 
J* £ Lip(Z, K x 1 _ l iK2 ) since Lip(Z, K t 1 _ l 3K2 ) is closed with respect to the sup-norm || • ||. □ 

The following theorem is the main result of this section. Recall that the policy f n £ F is obtained 
by extending the optimal policy /* of MDP„ to Z. 

Theorem 5.2. We have 


I Wn 


- J*\\ < 


r(P,K 2 )K 1T ^ 


i-Pk 2 


1-/3 


2Ki 


where r(/3, K 2 ) = (2 + (5)f5K 2 + ^ and a is the coefficient in (5.1 ). 
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Proof. To prove the theorem, we obtain upper bounds on the expressions derived in Section 2.1 
in terms of the cardinality n of Z„. The proof of Theorem 2.2 gives 


iw»,-)-*n< 


\\T fn J*-ff n J*\\ + (l + P)\\J*-J* 
1-/3 


To prove the theorem we upper bound \\Tf n J* — Tj n J* || and || J* — J*|| in terms n. For the first 
term we have 


\W -V\\ (z) - t,j-(z)\ 


zGZ 


< sup 

z £Z 


c(z,f n (z)) + /3 / J*(y)p(dy\z, f n (z)) - c(x, f n (x)) - ft / J*(y)p(dy\x,f n (x)) 


z GZ 


< sup / \K 1 dz(x,z) + P 


[j*{y)p{dy\z, f n (z )) -Jj*(y)p(dy\x, f n (z )) 


z,in(z) 


(dx) 


Vn,i n (z ) ( dx ) 


(since f n (x) = f n (z) for all x <E *S„ >in(z) ) 


— sup / (K\ + /3|| J*Hu P ^ 2 )dz(x,z)iy n:in(z) (dx) 
z GZ J 

<(K 1 + f3\\J*\\ Li pK 2 ) max diam(«S nii ) 

iG{l,...,n} 

<(iF 1 + /3||J*|| Lip iF 2 )2a(l/r i ) 1 / d . 

For the second term, the proof of Theorem 2.4 gives 

\T n J* — F n J*\\ + (1 + /3)|| J* — u* n \ 


(5.2) 


First consider \\T n J* — F n J*\\. Dehne 


1-/3 


so that 


l(z,a) :=c(z,a) + l 3 / J*(y)p(dy\z,a), 
Jx 


J*(z ) = min/(;<:, a). 

aG A 


It is straightforward to show that l(-, a) € Lip(Z, Kf) for all a £ A, where K l = K 1 + /3|| J*|| L ipi^ 2 - 
By adapting the proof of Lemma 2.3 to the value function J*, we obtain 


\\T n J* - F n J*\\ = sup 


z GZ 


mm 

aGA 


J l(x, 


a )v n ,in(z) (dx) - / min l(x,a)u n ^ z) (dx) 


aGA 


— sup / sup |/(y,ai) - J*(//)|zz„ jiTl(z )(dy) 

zGZ J yeS„ tin(z) 

nax / sup (|Z(y,a i )-Z(^ i ,a i )| + |J*( 2 r i )-J*(y)|}i/ n>i (dy) 

yGS„,i 

nax / sup {iFjd z (y,^i) +||J*|| L ipdz(^,y)}^n,i(dy) 


< max 

iG{ 


< max 

iG{ 

< (Kl + || J*|| Lip ) max diam(«S n>i ) 
<(/F ; + ||J*|| Lip )2a(l/n) 1/d . 

For the expression || J* — u* n ||. by Lemma 2.2 we have 


(5.3) 
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where <L r (z) = T,i" 1 r i ls ni (z), r = (ri,... ,r kn ). Since || J*|| L i P < oo, we have inf reZ k„ || J* - <L r || < 
||J*|| Lip max ig{1: ... j7l} diam(5 n i ) < || J*\\ Up 2a(l/n) 1/d . Hence 

IK - J *\ I < Y ^\\ r \\ Uv 2 a ( l / n ) 1/d . (5.4) 

Hence, by (5.3) and (5.4) we obtain 

IK ~r II < ((^2 + IKIhp + 2a(l/n)K (5.5) 

Then, the result follows from (5.2) and (5.5), and the fact ||«/* IILip < 

Remark 5.1. It is important to point out that if we replace Assumption 5.1-(h) with the uni¬ 
form Lipschitz continuity of p( ■ \z, a) in z with respect to total variation distance, then Theorem 5.2 
remains valid (with possibly different constants in front of the term (l/n) 1 ^). However, in this 
case, we do not need the assumption K 2 /3 < 1. 

Remark 5.2. For the average cost case, instead of assuming from the outset the uniform 
Lipschitz continuity of c and p in the z variable, we first derive a rate of convergence result in terms 
of the moduli of continuity of the functions oj c and u p in the z variable of c(z, a) and p{ ■ |z, a), where 
the total variation distance is used to define ui p . Then, we state that explicit rate of convergence 
result can be given if we impose some structural assumptions on uj c and u> p such as linearity, which 
corresponds to the uniform Lipschitz continuity of c(z, a ) and p( ■ |z, a ) in z. However, this is not the 
right approach for the discounted cost case as the modulus of continuity function u p is calculated 
using the Wasserstein distance of order 1. Indeed, to obtain a similar result as in the average cost 
case, we must relate oj c and u> p to the modulus of continuity Uj* of the value function J*. This can 
be established if uj c and u p are affine functions (i.e., w c (r) = Kir + Li and u; p (r) = K 2 r + L 2 ) using 
the dual formulation of the Wasserstein distance of order 1 [39, Theorem 5.10]: 


Wi(p,u) 


sup 

(ip,ip)ec b (z)xc b (z) 

^(x)-ip(y)<d 2 (x,y) 


J ip(z)n(dz) - 


tp(z)v(dz) 


However, in this situation we can explicitly compute the convergence rate only if L 1 = L 2 = 0 which 
is the uniform Lipschitz continuity case. 


5.2. Average Cost. In this section, we suppose that Assumptions 2.2 and 5.1-(j) hold. We 
define the modulus of continuity functions in the z variable of c(z,a) and p( ■ \z,a) as follows 

w c (r):=sup sup \c(z,a) — c(y,a)\ 

A z,y£Z:d-z(z,y)<r 

u P (r):= sup sup \\p{ ■ |z, a) -p{ ■ \y, a)\\ TV - 

a£A z,y^Z:di(z,y)<.r 

Since c(z,a) and p( ■ \z,a) are uniformly continuous, we have lim r ^ 0 oj c (r) = 0 and lim r ^, 0 uj p (r) = 0. 
Note that when ui c and uj p are linear, c(z,a) and p(-|z, a) are uniformly Lipschitz in z. In the 
remainder of this section, we first derive a rate of convergence result in terms of uj c and uj p . Then, 
we explicitly compute the convergence rate for the Lipschitz case as a corollary of this result. 

To obtain convergence rates for the average cost, we first prove a rate of convergence result for 
Lemma 2.6. To this end, for each n > 1, let d n := 2a(l/n) 1 / d , where a is the coefficient in (5.1). 

Lemma 5.1. For all t> 1, we have 

sup ||p*( • | y, f(y)) - ql( ■ | y, f(y))\\ T v < tu P (d n ). 

(y,f)e ZxF 
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Proof. Similar to the proof of Lemma 2.6, we use induction. For t = 1, recalling the proof of 
Lemma 2.6, the claim holds by the following argument: 

sup \\p(-\y,f(y))-q n (-\yJ(y))\\Tv<sup sup \\p(-\y,a)-p(-\x,a)\\ TV 
(vJ) eZxF yez (x,a)e5 ri in(y) x a 

— ^p (d n ) • 

Now, assume the claim is true for t > 1. Again recalling the proof of Lemma 2.6, we have 
sup ||p t+1 (■ \y, f{y)) — o£ +1 (- \y, f (y)) \\tv < sup ||p‘(- \yj(y)) -?*(■ b,/(y))|| Ty 

(y./)GZxF (y,/)eZxF 

+ sup ||p(-|2,/(z))-g„(-|2,/(z))|| TV 

(z,/)eZxF 

T tUJp ( d n ) “1“ COp ( d n ) (t T f)^Up( dn ) • 

This completes the proof. □ 

The following theorem is the main result of this section. A somewhat similar result was obtained 
in Hernandez-Lerma [20, Section 3.5], where identical assumptions are imposed on both the original 
model and the approximating model (see Hernandez-Lerma [20, Assumption 5.1]). Moreover, the 
approximating transition probability and one-stage cost function are assumed to converge to the 
original transition probability and one-stage cost function with respect to some rate; that is, p(n) := 
su P ( x,o )6 xxa \b n (x,a)-c(x,a) \ and vr(n) := sup (a , ia)eXxA \\q n (-\x,a)-p(-\x,a)\\ TV with p(n), ir(n) ->• 0 
as n — > oo. Although our result may appear to be a special case of the results in Hernandez-Lerma 
[20, Section 3.5], there are several differences: (i) our assumptions are only imposed for the the 
original model, and (ii) in Hernandez-Lerma [20, Section 3.5] the approximating models do not have 
finite state space while our approximating models are obtained by extending finite state models 
to the original state space, thereby, allowing for constructive numerical method to calculate near 
optimal policies. 

Recall that the optimal policy f* for MDP„ is obtained by extending the optimal policy /* for 
MDP n to Z, and R and k are the constants in Theorem 2.5. 

Theorem 5.3. For all t> 1, we have 

| Pf* ~ Pf* | < 4 ||c||Rk* + 2 uj c (d n ) + 2 \\c\\tu p (d n ). 

Proof. The proof of Theorem 2.6 gives 


I Pf* ~Pf*\< I Pf* ~ P}* I + \Pf* ~ P}* I + I P}* ~ Pf* I • 

J n J n J n J n J n J n 


Hence, to prove the theorem we obtain an upper bounds on the three terms in the sum. Consider 
the first term (recall the proof of Lemma 2.7) 


\Pf* -P/*l < s up|p/-p/| 

n Jn j e¥ 

< 2R«: t ||c|| + ||c|| sup \\q t n {-\y,f{y))-p t {-\yJ{y))\\Tv 

(yJ)GZxF 

< 2i2/e t ||c|| + ||c||tWp(d n ) (by Lemma 5.1). (5-6) 

For the second term, the proof of Lemma 2.11 gives 


IP/* - P/* I < IP;, ~P}*\ + \p), - P)* I 

J n Jn Jn Jn Jn Jn 

<sup|^-^| + |inf^-inf^| 

< 2 sup \pf — Pf\ 

/€ F 

< 2||6„ — c|| (see the proof of Lemma 2.9) 


< 2 sup 

( 2 ,a)GZxA ^ 


\c{x,a) - c(z,a)\v ntin ( z )(dx) 


(5.7) 
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For the last term, we have 


I i% - p ‘-\ = I W"> ~ %" 1 ' 1 s %s ~ pA 

< 2i?/c t ||c|| + \\c\\tujp(d n ) (by (5.6)). (5.8) 

Combining (5.6), (5.7), and (5.8) implies the result. □ 

To explicitly calculate a convergence rate, we need to impose some structural assumptions on uj c 
and uj p . One such assumption is linearity, which corresponds to the uniform Lipschitz continuity of 
c(z,a) and p( ■ \z,a) in z. This means that w c (r) = K x r and oj p (r) = K 2 r , or equivalently, \c(z,a) — 
c(y, a) | < K 1 d z (z, y) and \\p( ■ | z, a) -p( - \y, a) || < K 2 d z (z, y) for all z,y € Z and a G Z. In this case, 
by Theorem 5.3, for all t > 1 we have 

| pf* - pp | < 4||c||i2K* + AK l a(l/n) 1,d + A\\c\\K 2 a{l/n) 1,d t. (5.9) 

To obtain a proper rate of convergence result (i.e., an upper bound that only depends on n) the 
dependence of the upper bound on t has to be written as a function of n. This can be done by 
(approximately) minimizing the upper bound in (5.9) with respect to t for each n. Let us define 
the constants I\ := 4||c||ii, I 2 ■= AKia, and I 3 := 4||c|| K 2 a. Then the upper bound in (5.9) becomes 

W + I 2 (l/n) 1/d + I 3 {l/n) 1/d t. (5.10) 


For each n, it is straightforward to compute that 


t'(n) ■= ln( 


n 


1 /cl 


1 

Mi) 


is the zero of the derivative of the convex term in (5.10), where / 4 := - ^ t ) ■ Letting t = in 

(5.10), we obtain the following result. 

Corollary 5.1. Suppose that c(z,a ) and p( ■ \z,a) are uniformly Lipschitz continuous in z in 
addition to the assumptions imposed at the beginning of this section. Then, we have 

T n x l d 

I Ph -Pr\< (hh + / 2 )(l/n) 1/d + w ) K) {l/n )* ln (-^)• 


6. Order Optimality for Approximation Errors in the Rate of Quantization. The 

following example demonstrates that the order of the performance losses in Theorem 5.2 and 
Corollary 5.1 cannot be better than 0((4)a). More precisely, we exhibit a simple standard example 
where we can lower bound the performance loss by L{l/n) l ^ d , for some positive constant L. A 
similar result was obtained in Saldi et al. [33, Section IV] for the case of quantization of action 
space, where the action space was a compact subset of M m for some m > 1. Therefore, when both 
state and action spaces are quantized, then the resulting construction is order optimal in the above 
sense as the approximation error, in this case, is bounded by the sum of the approximation errors 
in quantization of state space and quantization of action space. 

In what follows h( ■) and h( ■ | •) denote differential and conditional differential entropies, respec¬ 
tively; see Cover and Thomas [12, Chapter 8]. 

Consider the additive-noise system: 


z t+1 = F(z t ,a t ) + v t ,t = 0,l,2,..., 

where z t ,a t ,v t € M d . We assume that sup ( , a ) eR d xR d < 1/2. The noise process {u t } is a se¬ 

quence of i.i.d. random vectors whose common distribution has density g supported on some 
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compact subset V of M d . We choose V such that Z = A can be taken to be compact subsets of 
R d . For simplicity suppose that the initial distribution p has the same density g. It is assumed 
that the differential entropy h{g) := — f z g(z) \ogg{z)dz is finite. Let the one stage cost function be 
c(z,a) := \\z — a||. Clearly, the optimal stationary policy /* is induced by the identity f*(z) = z , 
having the optimal cost J(f*,p ) = 0 and V(f*,p) = 0. Let f n be the piece-wise constant extension 
of the optimal policy f* of the MDP ra to the set Z. Fix n > 1 and define D t := Ejf [c(z t ,a t )] for all 
t. Then, since a t = f n (z t ) can take at most n values in A, by the Shannon lower bound (SLB) (see 
Yamada et al. [44, p. 12]) we have for t > 1 

logn > 12(A) > h(z t ) + 0(A) 

= h(F(zt_ i, a t - 1 ) + Vt-i ) + 0(A) 

> h(F(zt- 1 , a t - 1 ) + v t _i\zt-i,a t -i) + 0(A) (6-1) 

= h(v t _ 1 ) + 0(A), (6.2) 

where 0(A) = —d + log (^ dV f r{d ^ ■ 12(A) is the rate-distortion function of z t , V d is the volume 

of the unit sphere S d = {z : ||z|| < 1}, and T is the gamma function. Here, (6.1) follows from the fact 
that conditioning reduces the entropy (see Cover and Thomas [12, Theorem 2.6.5, p. 29]) and (6.2) 
follows from the independence of v t ~i and the pair (z t -i,a t -i). Note that h(v t ~i) = h(g) for all t. 
Thus, A > L(l/ny/ d , where L := f (df^Fpy) 1 ^- Since we have obtained stage-wise error bounds, 
these give \J(f*,g)-J(f n ,g)\ > ^(1 /n) 1/d and - V{f n ,n)\> L(l/n) 1/d . 

Remark 6.1. We note that if h(x t +i\x t ,a t ) can be lower bounded by some constant k for 
all t> 1, above analysis still holds by replacing h(g) with k. For instance, this is the case if the 
transition probability p( ■ | x,a) admits a density which is bounded from above uniformly in ( x,a ). 

7. Numerical Examples. In this section, we consider two examples, the additive noise model 
and fisheries management problem, in order to illustrate our results numerically. Since computing 
true costs of the policies obtained from the finite models is intractable, we only compute the value 
functions of the finite models and illustrate their converge to the value function of the original 
MDP as n —> oo. 

Before proceeding to the examples, we note that all results in this paper apply with straightfor¬ 
ward modifications for the case of maximizing reward instead of minimizing cost. 

7.1. Additive Noise System. In this example, the additive noise system is given by 

x t +1 = F(x t , a t ) + v t , * = 0 , 1 , 2 , — 

where x t ,a t ,v t £ K and X = R. The noise process {u t } is a sequence of M-valued i.i.d. random 
variables with common density g. Hence, the transition probability p( ■ \x, a) is given by 

p(D\x,a)= / g(v — F(x,a))m(dv) for all D € B(R), 

J D 

where m is the Lebesgue measure. The one-stage cost function is c(x, a) = (x — a) 2 , the action space 
is A = [—L,L\ for some L > 0, and the cost function to be minimized is the discounted cost. 

We assume that (i) g is a Gaussian probability density function with zero mean and variance a 2 , 
(ii) sup agA \F(x,a)\ 2 < kix 2 + k 2 for some ki,k 2 £ K+, (ii) (3 < 1/a for some a > £q, and (iv) F is 
continuous. Hence, Assumption 3.1 holds for this model with w(x) = k + x 2 and M = 4(-k-+ x 2 ), 
for some k £ M + . 

For the numerical results, we use the following parameters: F{x, a) = x + a, j3 = 0.3, L = 0.5, and 

(7 = 0 . 1 . 
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r 15 

We selected a sequence {[—Z n ,Z n ]j of nested closed intervals, where l n = 0.5 + 0.25n, to ap¬ 
proximate K. Each interval is uniformly discretized using [~2fc|-™-|Z„] grid points, where k m = 5 m 
for m = 1,..., 5 and [~g] denotes the smallest integer greater than or equal to q £ M. Therefore, the 
discretization is gradually refined. For each n, the finite state space is given by {x n Ai"i U{A„}, 
where {x n At"i are the representation points in the uniform quantization of the closed interval 
[— l n ,l n ] and A„ is a pseudo state. We also uniformly discretize the action space A = [—0.5,0.5] 
by using 2fc|-^ grid points. For each n, the finite state models are constructed as in Section 2 
by replacing Z with [—l n ,l n ] and by setting u n (•) = \m n (■)+f<W( •), where m n is the Lebesgue 
measure normalized over [— l n ,l n \. 

We use the value iteration algorithm to compute the value functions of the finite models. Figure 1 
displays the graph of these value functions corresponding to the different values for the number of 
grid points, when the initial state is x = 0.7. The figure illustrates that the value functions of the 
finite models converge to the value function of the original model. 



Figure 1. Optimal costs of the finite models when the initial state is x = 0.7 


7.2. Fisheries Management Problem. In this example we consider the following popula¬ 
tion growth model, called a Ricker model, see Hernandez-Lerma and Lasserre [21, Section 1.3]: 

x t+1 =0 1 a t exp{-0 2 a t + v t }, t = 0,1,2,... (7.1) 

where 9 1 ,0 2 £ R+, x t is the population size in season t, and a t is the population to be left for 
spawning for the next season, or in other words, x t — a t is the amount of fish captured in the season 
t. The one-stage ‘reward’ function is u(x t — a t ), where u is some utility function. In this model, the 
goal is to maximize the average reward. 

The state and action spaces are X = A = [K m i n ,K max ], for some ft m i n ,K max £ M+. Since the pop¬ 
ulation left for spawning cannot be greater than the total population, for each a; £ X, the set of 
admissible actions is A(x) = [fc m in>£] which is not consistent with our assumptions. However, we 
can (equivalently) reformulate above problem so that the admissible actions A(x) will become A 
for all x £ X. In this case, instead of dynamics in equation (7.1) we have 

x t +i = 0i min(a t , x t ) exp{-0 2 min (a t ,x t ) +v t }, t = 0,1,2,... 
and A(x) = [/? min , K max ] for all x € X. The one-stage reward function is u(x t — a t )l{ Xt > at }- 





Saldi, Yiiksel, and Linder: Asymptotic Optimality of Finite Approximations to MDPs 
Mathematics of Operations Research 00(0), pp. 000—000, ©0000 INFORMS 


39 


Since X is already compact, it is sufficient to discretize [re min , /c m ax]- The noise process {u t } is a 
sequence of independent and identically distributed (i.i.d.) random variables which have common 
density g supported on [0, A]. Therefore, the transition probability p( ■ \x,a) is given by 


p(D\x, a ) = Pr< x t+ i £ D 


x t = x, a t = a 


= Pr< 6i min(a, x) exp{—0 2 min(a, x) + v} £ D 


= / g log(w) — log(0i min(a, x)) + 0 2 min(a, x) )—m(dv), 
Jd V ) V 


for all D £ £>(M). To make the model consistent, we must have Oiy exp{— 0 2 y + u} £ [/tmim^max] for 
ffil (l/?^) ^ [^inin) ^max] ^ [fo A]. 

We assume that (i) g > e for some e € M+ on [0, A], (ii) g is continuous on [0, A], and (iii) the 
utility function u is continuous. Define h(v,x,a ) := g(log(u) — log(0i min(a, x)) + 0 2 min(a, x)j -, 
and for each (x,a) £ X x A, let S x>a denote the support of h( ■ ,x,a). Then, Assumption 2.2 holds 
for this model with 9(x,a) = inf„ eSa h(v,x,a) (provided that it is measurable), C = m-n (Lebesgue 
measure restricted on [K min , «; max ]), and for some A £ (0,1). 

For the numerical results, we use the following values of the parameters: 


01 = 1.1, 02 = 0.1, At max = 7, At min = 0.005, A = 0.5. 

We assume that the noise process is distributed uniformly over [0,0.5]. Hence, g = 1 on [0,0.5] 
and otherwise zero. The utility function u is taken to be the shifted isoelastic utility function (see 
Dufour and Prieto-Rumeau [13, Section 4.1]) 

u(z) = 3((z + 0.5) 1/3 - (0.5) 1/3 ). 

We selected 25 different values for the number n of grid points to discretize the state space: 
n = 10,20,30,..., 250. The grid points are chosen uniformly over the interval [/c m in, K max ]- We 
also uniformly discretize the action space A by using the following number of grid points: 5 n = 
50,100,150,...,1250. 

We use the relative value iteration algorithm (see Bertsekas [5, Chapter 4.3.1]) to compute the 
value functions of the finite models. For each re, the finite state models are constructed as in 
Section 2 by replacing Z with [re min , re max ] and by setting u n (•) = ' )■ 

Figure 2 shows the graph of the value functions of the finite models corresponding to the different 
values of re (number of grid points), when the initial state is x = 2. It can be seen that the value 
functions converge (to the value function of the original model). 


8. Conclusion. The approximation of a discrete time MDP by finite-state MDPs was con¬ 
sidered for discounted and average costs for both compact and non-compact state spaces. Under 
usual conditions imposed for studying Markov decision processes, it was shown that if one uses a 
sufficiently large number of grid points to discretize the state space, then the resulting finite-state 
MDP yields a near optimal policy. Under the Lipschitz continuity of the transition probability and 
the one-stage cost function, explicit bounds were derived on the performance loss due to discretiza¬ 
tion in terms of the number of grid points for the compact state case. These results were then 
illustrated numerically by considering two different MDP models. 
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Figure 2. Optimal rewards of the finite models when the initial state is x = 2 
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